The MapReduce system built on commodity hardware is the de facto standard for large scale data analysis. In this talk, we will describe its relationship to PRAMs and BSP, present new algorithms tailored specifically for this paradigm and conclude with some open questions.

### Monday, October 21st, 2013

LinkedIn is the largest professional network in the world with more than 225M members. It provides an effective channel to marketers interested in reaching out to affluent set of professionals throughout the world. In this talk I will provide an overview of various advertising models one can use on LinkedIn. Response prediction of a rare action like click is a fundamental input that is required to run these advertising systems in an automated fashion. This involves fitting large scale supervised learning algorithms to massive data consisting of hundreds of millions of observations and hundreds of thousands of predictors. I will describe ADMM algorithms and variations thereof that are easy to scale on readily available distributed computing infrastructures like Hadoop.

Graph theory is used to model large-scale complex systems in various domains, such as genomics and social network analysis. I will begin by describing a software stack for parallel analysis of very large graphs whose layered architecture combines performance, customizability, and conceptual simplicity though separation of concerns.

The Combinatorial BLAS backend implements state-of-the-art parallel algorithms for sparse matrix primitives that serve as building blocks for tightly-coupled graph computations. The Python based frontend, the Knowledge Discovery Toolbox (KDT), provides high-productivity and simpler graph abstractions. KDT’s analytic queries on filtered semantic graphs achieve high-performance through selective just-in-time embedded specialization that automatically translates KDT filters into efficiency layer. Filtering on attributed semantic graphs ena ble a broad set of real applications that maintain important auxiliary information through attributes on individual edges and vertices.

I will then present new parallel algorithms that reduce the communication costs of certain graph computations. The new parallel communication-avoiding algorithms are all-pairs shortest-paths, breadth-first search, and sparse matrix-matrix multiplication. Most graph algorithms have low computational intensity; hence their execution time is bounded by their communication requirements. In addition to improving the running time drastically, reducing communication can also help improve the energy consumption of graph algorithms.

I will give an overview of some of my work on dynamic asynchronous algorithms and systems for graph-parallel machine learning. I will begin by providing some background motivation on the need for structured probabilistic reasoning and some of the traditional tools and techniques used in this domain. I will then introduce our work on inference in probabilistic graphical models. In particular I will present our Splash BP and Splash Gibbs sampling algorithms and show how adaptive asynchronous parallelization can be used to accelerate convergence and preserve sequential statistical properties in the parallel and distributed settings. Motivated by our work on these algorithms I will provide an overview of GraphLab: a high-level abstraction and system designed to support the design and implementation of dynamic asynchronous graph-parallel algorithms.

The speed and scalability of algorithms for problems on large data-sets depends on their (i) locality—the amount of data the algorithms moves around, and (ii) degree of parallelism. Therefore, a cost model for quantifying complexity along these lines would be useful for a systematic study and comparison of algorithms.

This talk will present an overview of cost models for locality and parallelism including models that are program-centric, i.e., agnostic to the choice of parallel machine. We will discuss how costs in these models map to performance on parallel machines. To demonstrate their use and application, we will provide illustrative examples of analysis of algorithms (upper bounds) and lower bounds for problems.

We hope this talk will start a discussion on the suitability of different cost models for problems in the inference and optimization domains. Such a basis for comparison would help identify good algorithms, and highlight the necessity of new algorithms where optimal ones do not exist.

Standard multiclass classification techniques require O(k) computation during train and test where k is the number of classes, yet the information theoretic lower bound is O(log k). This gap matters little when k is on the order of 10 or 100, but at 10^4 or 10^6 it matters a great deal. I will discuss the theory of extreme multiclass classification including consistency, robustness, efficiency, and structure learning.

Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.

Machine learning (ML) and statistical techniques are crucial for transforming Big Data into actionable knowledge. However, the complexity of existing ML algorithms is often overwhelming. Many end-users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support ML are typically not accessible to ML developers without a strong background in distributed systems and low-level primitives. In this talk I will provide an overview of MLbase, a system designed to make it easier to (1) use machine learning and to (2) implement distributed ML algorithms. I will explain the declarative approach of ML as well as the high-level operators that will enable ML developers to scalably implement a wide range of ML methods without deep systems knowledge.

The frequency of small patterns, such 3 or 4-cliques or rectangles can reveal a lot of information about a graph. In social sciences, they are referred to as structural signatures, and in biology they are called graphlets. Subsequently, many algorithms have been proposed to efficiently count these patterns. We are developing sampling based approaches to count frequent patterns in graphs. Our methods come with provable error/confidence bounds. This talk will summarize the wedge sampling method for computing triadic measures, its generalization as a streaming algorithm, and our recent results on higher order structures.

### Tuesday, October 22nd, 2013

Although the largest data sets can only fit on large clusters of thousands of machines, there are many applications of "large data" that with appropriate techniques can fit on much more modest machines.

Graphs with more than 100 Billion edges (significantly larger than the twitter graph), for example, can fit in the primary memory on a modest rack mounted server and on secondary memory of a modest desktop machine. Furthermore multi or many-core shared memory machines can be much more efficient way of taking advantage of parallelism in terms of throughput per core or per watt than large distributed memory machines. I'll describe our experience mapping relatively large graph problems onto modest machines, including approaches for in-memory compression, for processing from secondary memory, and for using multicore parallelism within a single system.

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. I'll present a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all other parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice.

A critical bottleneck in parallelization of Strassen’s algorithm is the communication between the processors. We have proved lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal.

I'll talk about the algorithm and its theoretical computation and communication costs, show performance results from several machines, and also discuss the practicality of using Strassen instead of classical parallel algorithms.

Motivated by problems in large-scale data analysis, randomized algorithms for matrix problems such as regression and low-rank matrix approximation have been the focus of a great deal of attention in recent years. These algorithms exploit novel random sampling and random projection methods; and implementations of these algorithms have already proven superior to traditional state-of-the-art algorithms, as implemented in Lapack and high-quality scientific computing software, for moderately-large problems stored in RAM on a single machine. Here, we describe the extension of these methods to computing high-precision solutions in parallel and distributed environments that are more common in very large-scale data analysis applications.

In particular, we consider both the Least Squares Approximation problem and the Least Absolute Deviation problem, and we develop and implement randomized algorithms that take advantage of modern computer architectures in order to achieve improved communication profiles. Our iterative least-squares solver, LSRN, is competitive with state-of-the-art implementations on moderately-large problems; and, when coupled with the Chebyshev semi-iterative method, scales well for solving large problems on clusters that have high communication costs such as on an Amazon Elastic Compute Cloud cluster. Our iterative least-absolute-deviations solver is based on fast ellipsoidal rounding, random sampling, and interior-point cutting-plane methods; and we demonstrate significant improvements over traditional algorithms on MapReduce. In addition, this algorithm can also be extended to solve more general convex problems on MapReduce.

The QR factorization is one of the most important and useful matrix factorizations in scientific computing. A recent communication-avoiding version of the tall-and-skinny QR factorization is ideal to compute the R matrix in a QR factorization on MapReduce, an environment where computationally intensive processes operate on small subsets of a large database. Getting the Q factor in a stable manner is more difficult. We present a few implementations of the tall-and-skinny QR (TSQR) factorization in the MapReduce framework and pay particular attention to the numerical stability of the procedure. As a byproduct, we can used the RSVD procedure to compute the SVD of tall-and-skinny matrices as well. We discuss some applications to simulation data analysis and reduced order modeling.

The mathematical connections between graph theory and linear algebra are intimate and well known. The computational links between the two fields are also deep, extending all the way to the design of basic data structures and fundamental algorithms that are efficient in time, memory, and power consumption. In the first 50 years of this computational relationship, graphs served numerical linear algebra by enabling efficient sparse matrix computation. Recently, matrix computation has been returning the favor, particularly in the domain of parallel and high-performance algorithms.

We consider a network of agents that are cooperatively solving a global optimization problem, where the objective function is the sum of privately known local objective functions of the agents and the decision variables are coupled via linear constraints. Recent literature focused on special cases of this formulation and studied Alternating Direction Method of Multipliers (ADMM) based methods for their solution, which require a synchronous implementation and a globally known order on the agents. In this paper, we present a novel asynchronous ADMM based distributed method for the general formulation and show that it converges at the rate $Oleft(1/k ight)$.

We initiate a study of the problem of PAC-learning from distributed data and analyze fundamental communication complexity questions involved. Broadly, we consider a framework where information is distributed between several locations, and our goal is to learn a low-error hypothesis with respect to the overall distribution of data using as little communication, and as few rounds of communication, as possible. As an example, suppose k research groups around the world have collected large scientific datasets, such as genomic sequence data or sky survey data, and we wish to perform learning over the union of all these different datasets without too much communication.

We provide general upper and lower bounds on the amount of communication needed to learn a given class, as well as broadly-applicable techniques for achieving communication-efficient learning. We also analyze a number of important specific classes, giving efficient learning algorithms with especially good communication performance.

We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.

### Wednesday, October 23rd, 2013

In this talk I will describe a family of randomized parallel coordinate descent methods for minimizing a convex loss/objective function. The methods can be applied to smooth and composite (e.g., L1-regularized) functions [1, 2] as well as nonsmooth functions via a pre-processing smoothing step [3]. The family includes methods updating an (almost) arbitrary random subset of coordinates per iteration. Depending on the way the coordinates are selected and updated, the generic method specializes to, for instance, serial coordinate descent, partially parallel coordinate descent, distributed coordinate descent and gradient descent.

The critical observation is that parallelization of a coordinate descent method does not necessarily lead to acceleration. The amount of acceleration depends on the way coordinates are chosen at each iteration (which, thankfully, we can control) and, more importantly, on certain structural separability/sparsity properties of the loss function. I will hence describe several classes of 'good' functions which do admit parallelization speedup.

I will describe computational results involving billions of variables and terabyte data using ACDC: a fast implementation of the methods. Time permitting, I will also briefly comment on extensions to distributed and asynchronous settings and on the connection of the methods with stochastic gradient descent.

## References:

[1] Peter Richtárik and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, 2012

[2] Martin Takáč, Avleen Bijral, Peter Richtárik and Nati Srebro, Mini-batch primal and dual methods for SVMs, ICML 2013

[3] Olivier Fercoq and Peter Richtárik, Smooth minimization of nonsmooth functions with parallel coordinate descent methods, arXiv:1309.5885, 2013

[4] ACDC: http://code.google.com/p/ac-dc/

Kelner, Orecchia, Sidford, and Zhu recently introduced a novel almost-linear-time solver for Laplacian linear systems ("A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time", http://arxiv.org/abs/1301.6628). Their exposition of the algorithm is based on electrical flows. In the talk, I will present an algebraic view of the algorithm. This viewpoint clarifies the relationship between the new algorithm and older combinatorial Laplacian solvers and between the new algorithm and the Kaczmarz iterative solver. I will also explain the challenges that need to be addressed before the algorithm can be used on parallel machines to solve large-scale problems.

Asynchronous methods for solving systems of linear equations have been researched since Chazan and Miranker published their pioneering paper on chaotic relaxation in 1969. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to work and make progress even if not all progress made by other processors has been communicated to them.

Historically, work on asynchronous methods for solving linear equations focused on proving convergence in the limit. How the rate of convergence compares to the rate of convergence of the synchronous counterparts, and how it scales when the number of processors increase, was seldom studied and is still not well understood. Furthermore, the applicability of these methods was limited to restricted classes of matrices (e.g., diagonally dominant matrices).

In this talk we discus s novel shared-memory asynchronous methods. We rigorously analyze the convergence rate and prove that it is linear and close to that of our method's synchronous counterpart as long as not too many processors are used (relative to the size and sparsity of the matrix). A key component is randomization, which allows the processors to make guaranteed progress without introducing synchronization. Our analysis shows a convergence rate that is linear in the condition number of the matrix, and depends on the number of processors and the degree to which the matrix is sparse.

I will present two examples of how parallelization is aided by approximating the process. The first story is about approximating a greedy process. Greedy set cover is a simple algorithm that gives the best known approximation under standard complexity assumptions. But from a parallelization point of view, it doesn't parallelize readily because each greedy step depends on the choices made in previous steps. We will look at how a simple approximation breaks this dependency without substantially harming the solution's quality.

The second story is about approximating triangle count using streaming algorithms techniques. The number of triangles in a graph is an important metric with a wide variety of applications. We will see how to design a streaming algorithm to approximate the count, and using ideas from the streaming algorithm, we will derive an efficient parallel algorithm which processes edges in the streaming graph in bulk. When the graph has sufficiently many triangles, the total work is nearly linear.

The maximum monomial agreement (MMA) problem involves finding a single logical conjunction that best separates a weighted set of positive and negative examples, represented as binary vectors. This problem is relevant for boosting methods for classifiers. In this talk we will discuss Goldberg and Eckstein's branch-and-bound algorithm for MMA, and it's implementation in the PEBBL (Parallel Enumeration and Branch-and-Bound Library) general-purpose branch-and-bound framework. We will mention some of PEBBL's unique features. We will show computational results for some examples from the Hungarian heart dataset and spam datasets on a distributed memory parallel computer. One of the most challenging examples scales almost perfectly to 8000 processors.

Co-authors: Jonathan Eckstein (Rutgers University), William Hart (Sandia National Laboratories)

We present a provably efficient, layer-by-layer, algorithm for training deep sum-product neural networks. Our algorithm comes with formal polynomial time convergence guarantees. Similar to other deep architectures, the resulting network can compactly represent any function on a finite training set. We also discuss provably efficient algorithms for training other types of neural networks.

We describe peeling algorithms, a useful greedy paradigm leading to fast algorithms for big data sets. With peeling algorithms, typically the problem can be represented as a (random) hypergraph, and vertices and edges are peeled away when the degree of a vertex is at most some fixed amount (usually 1). In particular, we discuss an analysis for parallel peeling algorithms, which demonstrates that parallel peeling can lead to very fast algorithms in that the peeling completes in O(log log n) rounds when the entire hypergraph can be peeled away.

### Thursday, October 24th, 2013

Linear and mixed programming seem like ideal candidates for massively parallel processing. Such problems have enormous appetites for computing resources that grow quickly as the amount of available data expands, the results of these computations have tremendous strategic and financial value, and the algorithms commonly used to solve such problems provide seemingly obvious sources of significant parallelism. Despite all this, large-scale parallel computing plays only a modest role in practical optimization. We'll look at some of the algorithmic challenges that stand in the way of more widespread adoption of parallel computing.

We study data-oblivious algorithms as a natural way to model privacy-preserving data outsourcing where a client, Alice, stores sensitive data at an honest-but-curious cloud storage server, Bob. The general approach involves obscuring the access patterns to a remote storage so that the manager of that storage, Bob, cannot infer information about its contents. Previous solutions typically involve costly amortized overheads for achieving this goal and involve potentially huge variations in access times, depending on when they occur. We show that efficient privacy-preserving data access is possible using a combination of probabilistic encryption, which directly hides data values, stateless oblivious RAM simulation, which hides the pattern of data accesses, and algorithms that are explicitly data oblivious.

Algorithms have two costs: arithmetic and communication, i.e. moving data between levels of a memory hierarchy or processors over a network. Communication costs (measured in time or energy per operation) already greatly exceed arithmetic costs, and the gap is growing over time following technological trends. Thus our goal is to design algorithms that minimize communication. We present algorithms that attain provable lower bounds on communication, and show large speedups compared to their conventional counterparts. These algorithms are for direct and iterative linear algebra, for dense and sparse matrices, as well as direct n-body simulations. Several of these algorithms exhibit perfect strong scaling, in both time and energy: run time (resp. energy) for a fixed problem size drops proportionally to the number of processors p (resp. is independent of p). Finally, we describe extensions to algorithms involving arbitrary loop nests and array accesses, assuming only that array subscripts are affine functions of the loop indices.

This talk considers the problems of distributed online prediction and optimization. Each node in a network of processors receives a stream of data in an online manner. Before the next data point (or mini-batch) arrives, the processor must make a prediction. Then, after receiving the next point, the processor accrues some loss. The goal of the processors is to minimize the total aggregate loss. We propose a consensus-based distributed optimization method for fitting a model used to make the predictions online. After observing each data point, nodes individually make gradient descent-like adjustments to their model parameters, and then consensus iterations are performed to synchronize models across the nodes. We prove that the proposed distributed method achieves the optimal centralized regret bound when the loss function has Lipschitz continuous gradients, and the amount of communication required depends on the network topology in the usual way, through the second smallest eigenvalue of the graph Laplacian. This is joint work with Konstantinos Tsianos.

Generalized matching problems arise in different research areas of computer science, including computational advertising, recommender systems, and trade markets. Consider, for example, the problem of recommending multimedia items (e.g., DVDs) to users such that (1) users are recommended items that they are likely to be interested in, (2) every user gets neither too few nor too many recommendations, and (3) only items available in stock are recommended to users. In practice, instances of this and related problems may involve millions of users and items; traditional matching algorithms fail at such scales. In this paper, we propose a distributed algorithm for approximately solving large-scale generalized matching problems like the one above. Our algorithm is based on linear programming and is designed to run on a small cluster of commodity nodes (it is also amenable to MapReduce). In more detail, we develop a distributed approximation algorithm for a broad class of problems, i.e., mixed packing-covering linear programs. Our algorithm has strong approximation guarantees and requires only a poly-logarithmic number of passes over the data. We also propose a distributed rounding algorithm that transforms the fractional solution of the linear program into a near-optimal integral solution. Experiments on real-world and synthetic data suggest that our algorithm scales to very large problem sizes and can outperform alternative approaches by multiple orders of magnitude.

It has been demonstrated recently that for machine learning problems, modern stochastic optimization techniques such as stochastic dual coordinate ascent have significantly better convergence behavior than traditional optimization methods. In this talk I will present a broad view of this class of methods including some new algorithmic developments. I will also discuss algorithms and practical considerations in their parallel implementations.

Recent advances in wired and wireless technology necessitate the development of theory, models and tools to cope with new challenges posed by large-scale networks and various problems arising in current and anticipated applications over such networks. In this talk, optimization problems and algorithms for distributed multi-agent networked systems will be discussed. The distributed nature of the problem is reflected in agents having their own local (private) information while they have a common goal to optimize the sum of their objectives through some limited information exchange. The inherent lack of a central coordinator is compensated through the use of network to communicate certain estimates and the use of appropriate local-aggregation schemes. The overall approach allows agents to achieve the desired optimization goal without sharing the explicit form of their locally known objective functions. However, the agents are willing to cooperate with each other locally to solve the problem by exchanging some estimates of relevant information. Distributed algorithms will be discussed for synchronous and asynchronous implementations together with their basic convergence properties.