Line: 1 to 1  

Changed:  
< < 
 
> > 
 
Stencil ComputationsThe main objective of this activity is to optimize stencil computations for Cluster platforms with commodity (e.g. Gbit Ethernet) or sophisticated (e.g. SCI, Myrinet) interconnects. In this context, our research has focused on applying the tiling (or supernode) loop transformation to stencils in order to minimize the communication latency effect on the total parallel execution time of the algorithms. Tiling method groups neighboring computation points of the nested loop into blocks called tiles or supernodes thus increasing the computation grain and decreasing both the communication volume and frequence. Within the domain of stencil computations, nested loops and tiling transformation, our group has coped with several problems. Efficient scheduling techniques of tiled stencil applications that enable communication to computation overlap have been investigated, presented and awarded as one of the four best papers in IPDPS'01 (pdf). Automatic parallelization and efficient code generation methods have been proposed in TPDS (pdf) and Parallel Computing (pdf). Hybrid (MPI + OpenMP) parallel implementations of tiled stencil computations have been presented in IJCSE (pdf). 
Line: 1 to 1  

Stencil Computations  
Line: 6 to 6  
Publications  
Changed:  
< < 
 
> > 
 
This paper proposes a new method for the problem of
minimizing the execution time of nested forloops using a
tiling transformation. In our approach, we are interested  
Line: 28 to 28  
primitives show that the total completion time is significantly reduced.  
Changed:  
< < 
 
> > 

Line: 1 to 1  

Stencil Computations  
Changed:  
< <  The main objective of this activity is to optimize stencil computations for Cluster platforms with commodity (e.g. Gbit Ethernet) or sophisticated (e.g. SCI, Myrinet) interconnects. In this context, our research has focused on applying the tiling (or supernode) loop transformation to stencils in order to minimize the communication latency effect on the total parallel execution time of the algorithms. Tiling method groups neighboring computation points of the nested loop into blocks called tiles or supernodes thus increasing the computation grain and decreasing both the communication volume and frequence. Within the domain of stencil computations, nested loops and tiling transformation, our group has coped with several problems. Efficient scheduling techniques of tiled stencil applications that enable communication to computation overlap have been investigated, presented and awarded as one of the four best papers in IPDPS'01 (pdf). Automatic parallelization and efficient code generation methods have been proposed in TPDS (pdf) and Parallel Computing (pdf). Hybrid (MPI + OpenMP) parallel implementations of tiled stencil computations have been presented in IJCSE (pdf).  
> >  The main objective of this activity is to optimize stencil computations for Cluster platforms with commodity (e.g. Gbit Ethernet) or sophisticated (e.g. SCI, Myrinet) interconnects. In this context, our research has focused on applying the tiling (or supernode) loop transformation to stencils in order to minimize the communication latency effect on the total parallel execution time of the algorithms. Tiling method groups neighboring computation points of the nested loop into blocks called tiles or supernodes thus increasing the computation grain and decreasing both the communication volume and frequence. Within the domain of stencil computations, nested loops and tiling transformation, our group has coped with several problems. Efficient scheduling techniques of tiled stencil applications that enable communication to computation overlap have been investigated, presented and awarded as one of the four best papers in IPDPS'01 (pdf). Automatic parallelization and efficient code generation methods have been proposed in TPDS (pdf) and Parallel Computing (pdf). Hybrid (MPI + OpenMP) parallel implementations of tiled stencil computations have been presented in IJCSE (pdf).  
Publications 
Line: 1 to 1  

 
Changed:  
< <  The main objective of this activity is to optimize stencil computations for Cluster platforms with commodity (e.g. Gbit Ethernet) or sophisticated (e.g. SCI, Myrinet) interconnects. In this context, our research has focused on applying the tiling (or supernode) loop transformation to stencils in order to minimize the communication latency effect on the total parallel execution time of the algorithms. Tiling method groups neighboring computation points of the nested loop into blocks called tiles or supernodes thus increasing the computation grain and decreasing both the communication volume and frequence.  
> >  Stencil Computations  
Changed:  
< <  Applying the tiling techniques, we have developped a tool, which accepts Clike nested loops and partitions them into groups/tiles with small intercommunication requirements. The tool automatically generates efficient message passing code (using MPI) to be executed on SMPs or clusters. Future work contains comprarisons of certain variations of tiling (shape, size, etc) and code generation tecniques based on experimental results taken from the application of the tool on an SCIcluster.  
> >  The main objective of this activity is to optimize stencil computations for Cluster platforms with commodity (e.g. Gbit Ethernet) or sophisticated (e.g. SCI, Myrinet) interconnects. In this context, our research has focused on applying the tiling (or supernode) loop transformation to stencils in order to minimize the communication latency effect on the total parallel execution time of the algorithms. Tiling method groups neighboring computation points of the nested loop into blocks called tiles or supernodes thus increasing the computation grain and decreasing both the communication volume and frequence. Within the domain of stencil computations, nested loops and tiling transformation, our group has coped with several problems. Efficient scheduling techniques of tiled stencil applications that enable communication to computation overlap have been investigated, presented and awarded as one of the four best papers in IPDPS'01 (pdf). Automatic parallelization and efficient code generation methods have been proposed in TPDS (pdf) and Parallel Computing (pdf). Hybrid (MPI + OpenMP) parallel implementations of tiled stencil computations have been presented in IJCSE (pdf).  
Publications  
Changed:  
< < 
 
> > 
 
This paper proposes a new method for the problem of
minimizing the execution time of nested forloops using a
tiling transformation. In our approach, we are interested  
Line: 28 to 28  
primitives show that the total completion time is significantly reduced.  
Changed:  
< < 
 
> > 

Line: 1 to 1  

Added:  
> > 
Applying the tiling techniques, we have developped a tool, which accepts Clike nested loops and partitions them into groups/tiles with small intercommunication requirements. The tool automatically generates efficient message passing code (using MPI) to be executed on SMPs or clusters. Future work contains comprarisons of certain variations of tiling (shape, size, etc) and code generation tecniques based on experimental results taken from the application of the tool on an SCIcluster.
Publications
This paper proposes a new method for the problem of
minimizing the execution time of nested forloops using a
tiling transformation. In our approach, we are interested
not only in tile size and shape according to the required
communication to computation ratio, but also in overall
completion time. We select a time hyperplane to execute
different tiles much more efficiently by exploiting the inherent
overlapping between communication and computation
phases among successive, atomic tile executions. We assign
tiles to processors according to the tile space boundaries,
thus considering the iteration space bounds. Our schedule
considerably reduces overall completion time under the assumption
that some part from every communication phase
can be efficiently overlapped with atomic, pure tile computations.
The overall schedule resembles a pipelined datapath
where computations are not anymore interleaved with
sends and receives to nonlocal processors. Experimental
results in a cluster of Pentiums by using various MPI send
primitives show that the total completion time is significantly
reduced.
