12th Annual International Conference on Parallel Architectures and Compilation Techniques
New Orleans, Louisiana, Sept.27-Oct.1, 2003
** Each tutorial is ˝ Day **
[With questions contact organizers or Tutorial Chair Csaba Andras Moritz at email@example.com or 413-545 2442]
Saman P. Amarasinghe, Bill Thies; MIT
The streaming application class is becoming increasinly important and widespread. Encompassing programs for embedded signal processing, sensor networks, intelligent software routers, high-performance graphics and image processing, the streaming domain offers a fresh set of challenges for architects, language designers, and compiler writers. Unlike the scientific domain, stream programs are characterized by abundant parallelism, regular communication patterns, and short data lifetimes, all of which can be exploited to improve programmability and performance.
In this tutorial, we will describe recent advances in programming language and architectural support for stream programs. We will focus on lessons learned in the development of StreamIt, a language and compiler for high-performance streaming applications. Topics will include:
Saman P. Amarasinghe is an Associate Professor in the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology and a member of the MIT Laboratory for Computer Science. Currently he leads the Commit compiler group and is the co-leader of the Raw project. Saman's research interests are in discovering novel approaches to improve the performance of modern computer systems without unduly increasing the complexity faced by either application developers, compiler writers, or computer architects. He received his BS in Electrical Engineering and Computer Science from Cornell University in 1988, and his MSEE and Ph.D from Stanford University in 1990 and 1997, respectively.
Bill Thies is a graduate student at the MIT Laboratory for Computer Science. Along with other members of Saman's group, he is working on the design and implementation of StreamIt.
Tarek El-Ghazawi, The George Washington University; William Carlson (firstname.lastname@example.org) , IDA Center for Computing Sciences
20% Beginners, 50% Intermediate, 30% Advanced
UPC, or Unified Parallel C, is a parallel extension of ANSI C. UPC follows a distributed shared memory programming model aimed at leveraging the ease of programming of the
shared memory paradigm, while enabling the exploitation of data locality. To this end, UPC incorporates constructs that allow placing data near the threads that manipulate them to
minimize remote accesses. UPC has also many advanced synchronization features including mechanisms for overlapping synchronization with local processing and constructs for
defining memory consistency.
UPC is the effort of a consortium of universities, government and industry. It has been receiving rising attention from programmers and vendors and is now a product available
on new Cray and HP parallel computers. Source compilers for other platforms as well Total View debugger implementations are also available.
Dr. El-Ghazawi has received his Ph.D. degree in 1988 from New Mexico State University in Electrical and Computer Engineering. He is currently an Associate Professor at the Department of Electrical and Computer Engineering of the George Washington University. Prior to GWU has an Associate Professor of Computational Sciences and Computer Engineering at the George Mason University and a Visiting Scientists at the Research Institute for Advanced Computer Science(RIACSA) in NASA Ames Research Center. Tarek El-Ghazawi is one of the principal co-authors of the UPC and the UPC-IO specifications. He currently leads the UPC working group on Benchmarking and I/O in the UPC consortium. His research interests include high-performance computing and architectures, parallel I/O, and performance evaluations. Dr. El-Ghazawi has published extensively in these areas and his research has been supported by NASA, DoD, NSF, and industry. He has served as an associate editor for the International Journal on Parallel and Distributed Systems and Networking, and served as a guest editor for the IEEE Concurrency, special track on High-Performance Data Mining. He serves and served in many roles in technical program committees including the IEEE Aerospace Engineering, the Frontiers of the Massively Parallel Computations, and IEEE IPDPS. He is a Senior Member of the IEEE, a member of the ACM, a Fellow of the Arctic Region Supercomputing Center, and a member of Phi Kappa Phi.
Dr. William Carlson graduated from Worcester Polytechnic Institute in 1981 with a BS degree in Electrical Engineering. He then attended Purdue University, receiving the MSEE and Ph.D. degrees in Electrical Engineering in 1983 and 1988, respectively. From 1988 to 1990, Dr. Carlson was an Assistant Professor at the University of Wisconsin-Madison, where his work centered on performance evaluation of advanced computer architectures. Since 1990, he has been with the IDA Center for Computing Sciences, where his work focuses in the areas of operating systems, languages, and compilers for parallel and distributed computers. Accomplishments include the RES distributed computing system for harnessing very large numbers of workstations and AC, a distributed memory compiler for the Cray T3D and T3E mulitprocessor systems. The latter demonstrated the ability to program the T3E as a shared-memory machine and lead to the recent UPC effort. Currently, he leads the UPC effort to provide a shared-memory programming model across a wide range of platforms from distributed memory clusters to SMPs. This includes both the intellectual leadership of efforts to make the UPC language useful for many applications and the pragmatic coordination of a variety of UPC implementation efforts. He serves as a member of the NPACI External Visiting Committee, the PITAC OSS subcommittee, and several conference program committees. He is a member of the IEEE, ACM, and Eta Kappa Nu.
This tutorial is intended for university and industry computer architects who are interested in recent research developments in the area of Speculative Precomputation.
Speculative Precomputation, or pre-execution, is a new latency tolerance technique that uses spare hardware contexts in a multithreaded processor to accelerate the execution of a single-threaded executable. It does this without offloading any of the computation of the original program, in contrast to traditional parallelism. Speculative Precomputation executes code that precomputes data in the spare contexts that allows the main program to eliminate performance-degrading events, such as cache misses and branch mispredictions. It typically does this by borrowing code from the original program (using hand-coded, compiler driven, or automatic techniques), allowing it to precompute or prefetch things that traditional pattern based techniques for cache prefetching and branch prediction cannot. The increasing availability of multithreaded processors, coupled with the increasing importance of memory latencies and branch mispredictions to processor performance, make these techniques relevant and important. Researchers from both academia and industry have recently proposed and evaluated various techniques in this area. This tutorial covers several topics related to such recent developments, including architecture support for executing speculative threads, hardware and compiler techniques for extracting effective precomputation code, and performance evaluation of Speculative Precomputation systems on both simulators and silicon. In addition to covering current techniques and performance, this tutorial will also discuss the impact of Speculative Precomputation on processor and compiler design developments in industry.
Donald Yeung received his Ph.D. in 1998 from the Massachusetts Institute of Technology, where he was a member of the MIT Alewife Project. Currently, Dr. Yeung is an Assistant Professor in the Electrical and Computer Engineering Department at the University of Maryland at College Park, and co-directs the University of Maryland's Systems and Computer Architecture Laboratory. His research interests lie in the areas of computer architecture, performance evaluation of computer systems, and the interaction of architectures, compilers, and applications. Dr. Yeung is a recipient of an NSF Faculty Early Career Development Award.
Dean Tullsen received his PhD. from the University of Washington in 1996, where he did his dissertation on simultaneous multithreading. He is an associate professor in the Computer Science and Engineering department at UCSD. He co-directs the High-performance Processor Architecture and Compilation Lab at UCSD. His research interests are in high performance computer architecture, including multithreading architectures, memory and cache subsystems, and architecture-compiler interaction. He holds three patents in the area of multithreading architectures. Dr. Tullsen is a recipient of an NSF Faculty Early Career Development Award.
Steve Shih-wei Liao currently works in the Microprocessor Research Group at Intel Labs. His research interests are in advanced microarchitectures and compiler optimizations. He received his B.S. degree from National Taiwan University, and M.S. and Ph.D. degrees from Stanford University.
Implementation and Performance Issues
Mats Brorsson, and Sven Karlsson, KTH Description
OpenMP has become an important tool to bring parallel computing to a larger community. However, the first attempts to use OpenMP are sometimes discouraging in terms of performance since the user sometimes believe that all she/he needs to do is to insert some directives at suitable places, e.g. at for/do-loops with independent iterations. To understand the performance of an OpenMP program, it is important to understand how an OpenMP implementation can be done and how the synchronization and communication of a shared memory program is actually done in the real hardware.
In this tutorial, we describe the design and implementation of a complete OpenMP compilation system, consisting of a source-to-source OpenMP translator and a supporting run-time library. Furthermore, we discuss some performance-critical issues of OpenMP and how they relate to the implementation.
The participants will learn:
The example OpenMP system that we study consists of:
The software described in this tutorial is distributed freely with source code (except for the Fortran translator which only is distributed in binary version for Linux/Irix) and has been partly developed within the EU-funded project Intone under contract number IST-1999-20252.
Outline of tutorial
This tutorial is appropriate for software developers who want to understand performance of their OpenMP programs, what the compiler is doing with an OpenMP program and what happens in the run-time system and for parallel computing researchers who want to experiment with new OpenMP constructs or run-time support. A basic knowledge of OpenMP is presumed. A very short introduction to OpenMP is, however, given to refresh the audience.
Content level: 20% beginner, 60% intermediate, 20% advanced.
Mats Brorsson is a professor in computer architecture at KTH, the Royal Institute of Technology in Stockholm, Sweden. Prof. Brorsson has been conducting research and education in the area of parallel computing and computer architecture for more than 15 years and has published more than 30 papers in these areas. He received his MSc in electrical engineering in 1985 and his PhD in computer systems in 1994, both from Lund University in Lund, Sweden.
Sven Karlsson is a Ph.D. student at KTH, the Royal Institute of Technology in Stockholm, Sweden. He received a MSc in Engineering Physics in 1997 from Lund University. Mr. Karlsson’s research interests are in system software for parallel computers, most notably software distributed shared memory systems, and in compilers for parallel computers. He has published 7 papers in these fields.
Understanding Your Results:
Statistical Tools for Computer Performance Measurement and Simulation
David J. Lilja, University of Minnesota
Computer architects and system designers have made tremendous advances in the performance of computer systems over the past several decades. However, measuring a computer system's performance can be problematic since performance is impacted by many different components in extremely complex and nonlinear ways. For example, it is well understood that simply increasing the clock rate will not necessarily produce a proportionate increase in the overall performance. These complex interactions introduce noise and uncertainty into our measurements of a system's performance, which makes it difficult to determine the impact any changes made to the system actually have on the overall performance. It also makes it difficult to directly compare the performance of different systems. This tutorial provides a gentle introduction to some of the key statistical tools and techniques needed to interpret noisy performance measurements and to understand complex simulation results. It also presents techniques that can be used to appropriately design experiments to obtain the maximum amount of information for a given level of experimental effort.
2. Detailed Description.
In this tutorial, the participants will learn how to --
- Rigorously compare the performance of computer systems in the presence of measurement noise.
- Determine whether a change made to a system has a statistically significant impact on performance.
- Use statistical tools to reduce the number of simulations that need to be performed of a computer system.
- Design a set of experiments to obtain the most information for a given level of effort.
This tutorial is intended for computer architects, compiler writers, software designers, and application scientists and engineers who design or use high-performance computer systems. The level of the presentation is appropriate for both practitioners and graduate students. Experts from any scientific discipline will find this tutorial useful in helping to understand how to statistically analyze the performance of their systems and applications.
Content level: 30% beginner, 60% intermediate, 10% advanced.
This tutorial is appropriate for a very wide range of participants. It is assumed that the audience will have a basic understanding of probability and statistics at the level of an undergraduate in an engineering or scientific discipline. For example, they would be expected to understand what a mean value is, what the variance of set of measurements is, what is meant by the number of degrees-of-freedom of a statistic, and so forth.
David J. Lilja received the Ph.D. and M.S. degrees, both in Electrical Engineering, from the University of Illinois at Urbana-Champaign, and a B.S. in Computer Engineering from Iowa State University in Ames. He is currently a Professor of Electrical and Computer Engineering, and a Fellow of the Minnesota Supercomputing Institute, at the University of Minnesota in Minneapolis. He also serves as a member of the graduate faculties in Computer Science and Scientific Computation, and was the founding Director of Graduate Studies for Computer Engineering. He has been a visiting senior engineer in the Hardware Performance Analysis group at IBM in Rochester, Minnesota, and a visiting professor at the University of Western Australia in Perth supported by a Fulbright award. Previously, he worked as a research assistant at the Center for Supercomputing Research and Development at the University of Illinois, and as a development engineer at Tandem Computers Incorporated (now a division of HP/Compaq) in Cupertino, California. He has served on the program committees of numerous conferences; was a distinguished visitor of the IEEE Computer Society; is a Senior member of the IEEE and a member of the ACM; and is a registered Professional Engineer. His primary research interests are in high-performance computer architecture, parallel computing, nanocomputing, hardware-software interactions, and performance analysis.