Talks and Presentations

Every now and again, I give some public presentations. Even less frequently do I give talks which I think are great. Either way, here’s a collection of some talks that I’m happy about giving.

Practical Kerberos with Apache HBase

Kerberos is the system which underpins the vast majority of strong authentication across the Apache HBase/Hadoop application stack. Kerberos errors have brought many to their knees and it is often referred to as “black magic” or “the dark arts”; a long-standing joke that there are so few who understand how it works. This talk will cover the types of problems that Kerberos solves and doesn’t solve for HBase, decrypt some jargon on related libraries and technology that enable Kerberos authentication in HBase and Hadoop, and distill some basic takeaways designed to ease users in developing an application that can securely communicate with a “kerberized” HBase installation.

HBaseCon East 2016 (2016/09/26 - New York City, NY)

Effective Testing of Apache Accumulo Iterators

Apache Accumulo’s Iterator are a powerful API which developers leverage to efficiently perform operations like aggregations and filters, reducing latency of these operations by orders of magnitude. However, Iterators are notoriously difficult to implement correctly. This talk will introduce an Iterator testing harness designed to improve code quality on newly created iterators, catch common runtime pitfalls, and present an end-to-end testing solution for Iterators.

Accumulo Summit 2016 (2016/10/11 - College Park, MD)

Apache HBase Internals you Hoped you Never Needed to understand

Future of Data Meetup Group NYC (2016/10/11 - New York City, NY)

Phoenix + HBase: An Enterprise Grade Data-Warehouse Appliance for Interactive Analytics?

In this talk, we will present how HBase and Phoenix can become a company’s data-warehouse appliance for fast interactive analytics. We will share some experience on how companies are currently using this appliance for their decision-support system. We will present use cases that depend on being able to run hundreds of queries in parallel on fact tables with more than 100 billion rows, while also expecting really fast responses on individual queries. We will take a look at the internals of Phoenix and HBase at a high level and some relevant pieces that makes those use-cases possible. We will also go one level deeper, justifying that the architecture presented is viable in terms of the features that enterprises expect: low setup and maintenance cost, highly available & scalable, disaster recovery support, security support, and others. We will cover how the Phoenix QueryServer enables companies to access and leverage Phoenix from new sources like Python, Ruby or C/C++. Finally, we will shed some light on how the Phoenix/HBase architecture fits in the data-warehouse space in terms of integration with other processing engines for ETL like MapReduce, Hive, Spark, and Pig as well as providing user insight via BI tools over ODBC.

Co-Presented with Ankit Singhal and RajeshBabu Chintaguntla.

Hadoop Summit San Jose 2016 (2016/06/30 - San Jose, CA)

Apache Accumulo 1.8.0 Overview

A brief overview to Apache Accumulo 1.8.0.

Apache Accumulo Meetup Group at Hadoop Summit San Jose 2016 (2016/06/27 - San Jose, CA)

Apache Phoenix Query Server

An overview of the Apache Phoenix Query Server component and how it enables development of realtime Hadoop applications using Java, .NET, Python, ODBC and more.

PhoenixCon 2016 (2016/05/25 San Francisco, CA)

Introduction to Apache Calcite

A brief introduction to Apache Calcite.

Washington DC, Maryland and Virginia Hortonworks User Group (2016/04/20 - Herndon, VA)

Demystifying the Apache Phoenix Query Server

A brief introduction to the Apache Phoenix Query Server.

Washington DC, Maryland and Virginia Hortonworks User Group (2016/04/13 - Baltimore, MD)

Designing and Testing Apache Accumulo Iterators

An overview on good Accumulo Iterator design and testing.

Apache Accumulo Meetup Group (2015/11/10 - Washington DC)

Alternatives to Apache Accumulo’s Java application

A common tradeoff made by fault-tolerant, distributed systems is the ease of user interaction with the system. Implementing correct distributed operations in the face of failures often takes priority over reducing the level of effort required to use the system. Because of this, applying a problem in a specific domain to the system can require significant planning and effort by the user. Apache Accumulo, and its sorted, Key-Value data model, is subject to this same problem: it is often difficult to use Accumulo to quickly ascertain real-life answers about some concrete problem.

This problem, not unique to Accumulo itself, has spurred the growth of numerous projects to fill these kinds of gaps in usability, in addition to multiple language bindings provided by applications. Outside of the Java API, Accumulo client support varies from programming languages, like Python or Ruby, to standalone projects that provide their own query language, such as Apache Pig and Apache Hive. This talk will cover the state of client support outside of Accumulo’s Java API with an emphasis on the pros, cons, and best practices of each alternative.

Accumulo Summit 2015 (2015/04/28 - College Park, MD)

Data Center Replication with Apache Accumulo

Apache Accumulo presently lacks the ability to automatically replicate its contents to another Accumulo instance with low latency. The only options currently available involve quiescing a table, exporting that table, copying it to the remote instance and importing it. This is unacceptable for a few reasons, the most important of these reasons being the require unavailability to export the given table. This talk will outline the problems in designing a low-latency replication system for Accumulo tables, describe an implementation that leverages some useful features of Accumulo, and outlines future work in the area.

Accumulo Summit 2014 (2014/06/12 - College Park, MD)