diff --git a/BINARY_README b/BINARY_README
index ed431bd515..67c581a542 100644
--- a/BINARY_README
+++ b/BINARY_README
@@ -1,4 +1,45 @@
 
+                         PowerLyra Binary Release
+                         ------------------------
+
+=======
+License
+=======
+
+PowerLyra is free software licensed under the Apache 2.0 License. See
+license/LICENSE.txt for details.
+
+============
+Introduction
+============
+
+PowerLyra is based on the latest codebase of GraphLab PowerGraph (a distributed graph 
+computation framework written in C++) and can seamlessly support all GraphLab toolkits.
+
+PowerLyra Features:
+
+Hybrid computation engine:     Exploit the locality of low-degree vertices 
+                               and the parallelism of high-degree vertices
+
+Hybrid partitioning algorithm: Differentiate the partitioning algorithms 
+                               for different types of vertices
+
+Diverse scheduling strategy:   Provide both synchronous and asynchronous 
+                               computation engines
+
+Compatible API:                Seamlessly support all GraphLab toolkits 
+
+
+======================
+Installation and Usage
+======================
+
+The installation and tutorial of PowerLyra fully follow that of GraphLab. 
+See following notes for details.
+
+
+
+
                          Graphlab Binary Release
                          -----------------------
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8d2caafcee..f5d60b282d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -236,9 +236,9 @@ if(NOT NO_TCMALLOC)
     # We use tcmalloc for improved memory allocation performance
     ExternalProject_Add(libtcmalloc
     PREFIX ${GraphLab_SOURCE_DIR}/deps/tcmalloc
-    # Some users can't access domain googlecode.com ,This is a spare URL
-    # URL http://sourceforge.jp/projects/sfnet_gperftools.mirror/downloads/gperftools-2.0.tar.gz
-    URL http://gperftools.googlecode.com/files/gperftools-2.0.tar.gz
+    # Some users can't access domain googlecode.com, so we replace it with a link to our project server
+    # URL http://gperftools.googlecode.com/files/gperftools-2.0.tar.gz
+    URL http://ipads.se.sjtu.edu.cn/projects/powerlyra/deps/gperftools-2.0.tar.gz
     URL_MD5 13f6e8961bc6a26749783137995786b6
     PATCH_COMMAND patch -N -p0 -i ${GraphLab_SOURCE_DIR}/patches/tcmalloc.patch || true
     CONFIGURE_COMMAND <SOURCE_DIR>/configure --enable-frame-pointers --prefix=<INSTALL_DIR> ${tcmalloc_shared}
diff --git a/README.md b/README.md
index 5e02a54c01..7b98bbcc2a 100644
--- a/README.md
+++ b/README.md
@@ -1,269 +1,64 @@
-# GraphLab PowerGraph v2.2
-
-## UPDATE: For a signficant evolution of this codebase, see GraphLab Create which is available for download at [dato.com](http://dato.com)
-
-## History
-In 2013, the team that created GraphLab PowerGraph started the Seattle-based company, GraphLab, Inc. The learnings from GraphLab PowerGraph and GraphChi projects have culminated into GraphLab Create, a enterprise-class data science platform for data scientists and software engineers that can simplify building and deploying advanced machine learning models as a RESTful predictive service. In January 2015, GraphLab, Inc. was renamed to Dato, Inc. See [dato.com](http://dato.com) for more information. 
-
-## Status
-GraphLab PowerGraph is no longer in active development by the founding team. GraphLab PowerGraph is now supported by the community at [http://forum.dato.com/](http://forum.dato.com/).  
-
-# Introduction
-
-GraphLab PowerGraph is a graph-based, high performance, distributed computation framework written in C++. 
-
-The GraphLab PowerGraph academic project was started in 2009 at Carnegie Mellon University to develop a new parallel computation abstraction tailored to machine learning. GraphLab PowerGraph 1.0 employed shared-memory design. In GraphLab PowerGraph 2.1, the framework was redesigned to target the distributed environment. It addressed the difficulties with real-world power-law graphs and achieved unparalleled performance at the time. In GraphLab PowerGraph 2.2, the Warp System was introduced and provided a new flexible, distributed architecture around fine-grained user-mode threading (fibers). The Warp System allows one to easily extend the abstraction, to improve optimization for example, while also improving usability.
-
-GraphLab PowerGraph is the culmination of 4-years of research and development into graph computation, distributed computing, and machine learning. GraphLab PowerGraph scales to graphs with billions of vertices and edges easily, performing orders of magnitude faster than competing systems. GraphLab PowerGraph combines advances in machine learning algorithms, asynchronous distributed graph computation, prioritized scheduling, and graph placement with optimized low-level system design and efficient data-structures to achieve unmatched performance and scalability in challenging machine learning tasks.
-
-Related is GraphChi, a spin-off project separate from the GraphLab PowerGraph project. GraphChi was designed to run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive) enabling a single desktop computer (actually a Mac Mini) to tackle problems that previously demanded an entire cluster. For more information, see [https://github.com/GraphChi](https://github.com/GraphChi).
-
-# License
-
-
-GraphLab PowerGraph is released under the [Apache 2 license](http://www.apache.org/licenses/LICENSE-2.0.html).
-
-If you use GraphLab PowerGraph in your research, please cite our paper:
-```
-    @inproceedings{Low+al:uai10graphlab,
-      title = {GraphLab: A New Parallel Framework for Machine Learning},
-      author = {Yucheng Low and
-                Joseph Gonzalez and
-                Aapo Kyrola and
-                Danny Bickson and
-                Carlos Guestrin and
-                Joseph M. Hellerstein},
-      booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
-      month = {July},
-      year = {2010}
+# PowerLyra v1.0
+## License
+
+PowerLyra is released under the [Apache 2 license](http://www.apache.org/licenses/LICENSE-2.0.html).
+
+If you use PowerLyra in your research, please cite our paper:
+```    
+    @inproceedings{Chen:eurosys2015powerlyra,
+     title = {PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs},
+     author = {Chen, Rong and Shi, Jiaxin and Chen, Yanzhe and Chen, Haibo},
+     booktitle = {Proceedings of the Tenth European Conference on Computer Systems},
+     series = {EuroSys '15},
+     year = {2015},
+     location = {Bordeaux, France},
     }
 ```
 
-# Academic and Conference Papers
-
-Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "[PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/gonzalez)." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
-
-Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin and Joseph M. Hellerstein (2012). "[Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud](http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf)." Proceedings of the VLDB Endowment (PVLDB).
 
-Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein (2010). "[GraphLab: A New Parallel Framework for Machine Learning](http://arxiv.org/pdf/1006.4990v1.pdf)." Conference on Uncertainty in Artificial Intelligence (UAI).
+## Introduction
 
-Li, Kevin; Gibson, Charles; Ho, David; Zhou, Qi; Kim, Jason; Buhisi, Omar; Brown, Donald E.; Gerber, Matthew, "[Assessment of machine learning algorithms in cloud computing frameworks](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6549501)", Systems and Information Engineering Design Symposium (SIEDS), 2013 IEEE, pp.98,103, 26-26 April 2013
+PowerLyra is based on the latest codebase of GraphLab PowerGraph (a distributed graph computation framework written in C++) and can seamlessly support all GraphLab toolkits. PowerLyra provides several new hybrid execution engines and partitioning algorithms to achieve optimal performance by leveraging input graph properties (e.g., power-law and bipartite). 
 
-[Towards Benchmarking Graph-Processing Platforms](http://sc13.supercomputing.org/sites/default/files/PostersArchive/post152.html). by Yong Guo (Delft University of Technology), Marcin Biczak (Delft University of Technology), Ana Lucia Varbanescu (University of Amsterdam), Alexandru Iosup (Delft University of Technology), Claudio Martella (VU University Amsterdam), Theodore L. Willke (Intel Corporation), in Super Computing 13
+PowerLyra New Features:
 
-Aapo Kyrola, Guy Blelloch, and Carlos Guestrin (2012). "[GraphChi: Large-Scale Graph computation on Just a PC](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/kyrola)." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
+* **Hybrid computation engine:** Exploit the locality of low-degree vertices and the parallelism of high-degree vertices
 
+* **Hybrid partitioning algorithm:** Differentiate the partitioning algorithms for different types of vertices
 
-# The Software Stack
+* **Diverse scheduling strategy:** Provide both synchronous and asynchronous computation engines
 
-The GraphLab PowerGraph project consists of a core API and a collection of high-performance machine learning and data mining toolkits built on top. The API is written in C++ and built on top of standard cluster and cloud technologies. Inter-process communication is accomplished over TCP-IP and MPI is used to launch and manage GraphLab PowerGraph programs. Each process is multithreaded to fully utilize the multicore resources available on modern cluster nodes. It supports reading and writing to both Posix and HDFS filesystems.
+* **Compatible API:** Seamlessly support all GraphLab toolkits 
 
-![GraphLab PowerGraph Software Stack](images/gl_os_software_stack.png "GraphLab Software Stack")
+For more details on the PowerLyra see http://ipads.se.sjtu.edu.cn/projects/powerlyra.html, including new features, instructions, etc.
 
-GraphLab PowerGraph has a large selection of machine learning methods already implemented (see /toolkits directory in this repo). You can also implement your own algorithms on top of the graph programming API (a certain degree of C++ knowledge is required).
 
-GraphLab PowerGraph Feature Highlights
---------------------------------------
+### Hybrid Computation Engine
 
-* **Unified multicore/distributed API:** write once run anywhere 
+We argue that skewed distribution in natural graphs also calls for differentiated processing of high-degree and low-degree vertices. PowerLyra uses Pregel/GraphLab-like computation models for process low-degree vertices to minimize computation, communication and synchronization overhead, and uses PowerGraph-like computation model for process high-degree vertices to reduce load imbalance, contention and memory pressure. PowerLyra follows the interface of GAS (Gather, Apply and Scatter) model and can seamlessly support various graph algorithms (e.g., all GraphLab toolkits).
 
-* **Tuned for performance:** optimized C++ execution engine leverages extensive multi-threading and asynchronous IO 
-
-* **Scalable:** Run on large cluster deployments by intelligently placing data and computation 
-
-* **HDFS Integration:** Access your data directly from HDFS 
-
-* **Powerful Machine Learning Toolkits:** Tackle challenging machine learning problems with ease
-
-## Building
+![Hybrid Computation Engine](images/hybrid_engine.png "Hybrid Computation Engine")
 
-The current version of GraphLab PowerGraph was tested on Ubuntu Linux 64-bit 10.04,  11.04 (Natty), 12.04 (Pangolin) as well as Mac OS X 10.7 (Lion) and Mac OS X 10.8 (Mountain Lion). It requires a 64-bit operating system.
 
-# Dependencies
+### Hybrid Graph Partitioning
 
-To simplify installation, GraphLab PowerGraph currently downloads and builds most of its required dependencies using CMake’s External Project feature. This also means the first build could take a long time.
+PowerLyra additionally proposes a new hybrid graph cut algorithm that embraces the best of both worlds in edge-cut and vertex-cut, which evenly distributes low-degree vertices along with their edges like edge-cut, and evenly distributes edges of high-degree vertices like vertex-cut. Both theoretical analysis and empirical validation show that the expected replication factor of random hybrid-cut is alway better than both random (Hash-based), contrained (e.g., Grid), and heuristic (e.g., Oblivious or Coordinated) vertex-cut for skewed power-law graphs. 
 
-There are however, a few dependencies which must be manually satisfied.
+![Hybrid Partitioning Algorithms](images/hybrid_cut.png "Hybrid Graph Partitioning")
 
-* On OS X: g++ (>= 4.2) or clang (>= 3.0) [Required]
-  +  Required for compiling GraphLab.
 
-* On Linux: g++ (>= 4.3) or clang (>= 3.0) [Required]
-  +  Required for compiling GraphLab.
+## Academic and Conference Papers
 
-* *nix build tools: patch, make [Required]
-   +  Should come with most Mac/Linux systems by default. Recent Ubuntu version will require to install the build-essential package.
+Rong Chen, Jiaxin Shi, Yanzhe Chen and Haibo Chen. "[PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs](http://ipads.se.sjtu.edu.cn/projects/powerlyra/powerlyra-eurosys-final.pdf)." Proceeding of the 10th ACM SIGOPS European Conference on Computer Systems (EuroSys). Bordeaux, France. April, 2015.
 
-* zlib [Required]
-   +   Comes with most Mac/Linux systems by default. Recent Ubuntu version will require the zlib1g-dev package.
+Rong Chen, Jiaxin Shi, Binyu Zang and Haibing Guan. "[BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning](http://ipads.se.sjtu.edu.cn/projects/powerlyra/bigraph-apsys14.pdf)." Proceeding of the 5th Asia-Pacific Workshop on Systems (APSys). Beijing, China. June, 2014.
 
-* Open MPI or MPICH2 [Strongly Recommended]
-   + Required for running GraphLab distributed. 
+Rong Chen, Jiaxin Shi, Haibo Chen and Binyu Zang. "[Bipartite-oriented Distributed Graph Partitioning for Big Learning](http://ipads.se.sjtu.edu.cn/projects/powerlyra/bigraph-jcst.pdf)." Journal of Computer Science and Technology (JCST), 30(1), pp. 20-29. January, 2015.
 
-* JDK 6 or greater [Optional]
-   + Required for HDFS support 
-
-## Satisfying Dependencies on Mac OS X
-
-Installing XCode with the command line tools (in XCode 4.3 you have to do this manually in the XCode Preferences -&gt; Download pane), satisfies all of these dependencies.
-
-## Satisfying Dependencies on Ubuntu
-
-All the dependencies can be satisfied from the repository:
-
-    sudo apt-get update
-    sudo apt-get install gcc g++ build-essential libopenmpi-dev openmpi-bin default-jdk cmake zlib1g-dev git
-
-# Downloading GraphLab PowerGraph
-
-You can download GraphLab PowerGraph directly from the Github Repository. Github also offers a zip download of the repository if you do not have git.
-
-The git command line for cloning the repository is:
-
-    git clone https://github.com/graphlab-code/graphlab.git
-    cd graphlab
-
-
-# Compiling and Running
-
-```
-./configure
-```
-
-In the graphlabapi directory, will create two sub-directories, release/ and debug/ . cd into either of these directories and running make will build the release or the debug versions respectively. Note that this will compile all of GraphLab, including all toolkits. Since some toolkits require additional dependencies (for instance, the Computer Vision toolkit needs OpenCV), this will also download and build all optional dependencies.
-
-We recommend using make’s parallel build feature to accelerate the compilation process. For instance:
-
-```
-make -j4
-```
 
-will perform up to 4 build tasks in parallel. When building in release/ mode, GraphLab does require a large amount of memory to compile with the heaviest toolkit requiring 1GB of RAM.
-
-Alternatively, if you know exactly which toolkit you want to build, cd into the toolkit’s sub-directory and running make, will be significantly faster as it will only download the minimal set of dependencies for that toolkit. For instance:
-
-```
-cd release/toolkits/graph_analytics
-make -j4
-```
-
-will build only the Graph Analytics toolkit and will not need to obtain OpenCV, Eigen, etc used by the other toolkits.
-
-## Compilation Issues
-If you encounter issues please post the following on the [GraphLab forum](http://forum.graphlab.com).
-
-* detailed description of the problem you are facing
-* OS and OS version
-* output of uname -a
-* hardware of the machine
-* utput of g++ -v and clang++ -v
-* contents of graphlab/config.log and graphlab/configure.deps
-
-# Writing Your Own Apps
-
-There are two ways to write your own apps.
-
-* To work in the GraphLab PowerGraph source tree, (recommended)
-* Install and link against Graphlab PowerGraph (not recommended)
-
-
-## 1:  Working in the GraphLab PowerGraph Source Tree
-
-This is the best option if you just want to try using GraphLab PowerGraph quickly. GraphLab PowerGraph
-uses the CMake build system which enables you to quickly create
-a C++ project without having to write complicated Makefiles. 
-
-1. Create your own sub-directory in the apps/ directory. for example apps/my_app
-   
-2. Create a CMakeLists.txt in apps/my_app containing the following lines:
-
-    project(GraphLab) 
-    add_graphlab_executable(my_app [List of cpp files space separated]) 
-
-3. Substituting the right values into the square brackets. For instance:
-
-    project(GraphLab) 
-    add_graphlab_executable(my_app my_app.cpp) 
-
-4. Running "make" in the apps/ directory of any of the build directories 
-should compile your app. If your app does not show up, try running
-
-    cd [the GraphLab API directory]
-    touch apps/CMakeLists.txt
-
-
-## 2: Installing and Linking Against GraphLab PowerGraph
-
-To install and use GraphLab PowerGraph this way will require your system
-to completely satisfy all remaining dependencies, which GraphLab PowerGraph normally 
-builds automatically. This path is not extensively tested and is 
-**not recommended**
-
-You will require the following additional dependencies
- - libevent (>=2.0.18)
- - libjson (>=7.6.0)
- - libboost (>=1.53)
- - libhdfs (required for HDFS support)
- - tcmalloc (optional)
-
-Follow the instructions in the [Compiling] section to build the release/ 
-version of the library. Then cd into the release/ build directory and 
-run make install . This will install the following:
-
-* include/graphlab.hpp
- +   The primary GraphLab header 
-*  include/graphlab/...
- +   The folder containing the headers for the rest of the GraphLab library 
-*  lib/libgraphlab.a
- +   The GraphLab static library.
-    
-Once you have installed GraphLab PowerGraph you can compile your program by running:
-
-```
-g++ -O3 -pthread -lzookeeper_mt -lzookeeper_st -lboost_context -lz -ltcmalloc -levent -levent_pthreads -ljson -lboost_filesystem -lboost_program_options -lboost_system -lboost_iostreams -lboost_date_time -lhdfs -lgraphlab hello_world.cpp
-```
-    
-If you have compiled with MPI support, you will also need
-
-   -lmpi -lmpi++ 
-   
-# Tutorials
-See [tutorials](TUTORIALS.md)
-
-# Datasets
-The following are data sets links we found useful when getting started with GraphLab PowerGraph.
-
-##Social Graphs
-* [Stanford Large Network Dataset (SNAP)](http://snap.stanford.edu/data/index.html)
-* [Laboratory for Web Algorithms](http://law.di.unimi.it/datasets.php)
-
-##Collaborative Filtering
-* [Million Song dataset](http://labrosa.ee.columbia.edu/millionsong/)
-* [Movielens dataset GroupLens](http://grouplens.org/datasets/movielens/)
-* [KDD Cup 2012 by Tencent, Inc.](https://www.kddcup2012.org/)
-* [University of Florida sparse matrix collection](http://www.cise.ufl.edu/research/sparse/matrices/)
-
-##Classification
-* [Airline on time performance](http://stat-computing.org/dataexpo/2009/)
-* [SF restaurants](http://missionlocal.org/san-francisco-restaurant-health-inspections/)
-
-##Misc
-* [Amazon Web Services public datasets](http://aws.amazon.com/datasets)
-  
-# Release Notes
-##### **map_reduce_vertices/edges and transform_vertices/edges are not parallelized on Mac OS X**
-
-These operations currently rely on OpenMP for parallelism.
-
-On OS X 10.6 and earlier, gcc 4.2 has several OpenMP bugs and is not stable enough to use reliably.
-
-On OS X 10.7, the clang
-++ compiler does not yet support OpenMP.
+## Building
 
-##### **map_reduce_vertices/edges and transform_vertices/edges use a lot more processors than what was specified in –ncpus**
+The building, installation and tutorial of PowerLyra fully follow that of GraphLab PowerGraph. See README_GraphLab.txt for details.
 
-This is related to the question above. While there is a simple temporary solution (omp_set_num_threads), we intend to properly resolve the issue by not using openMP at all.
 
-##### **Unable to launch distributed GraphLab when each machine has multiple network interfaces**
 
-The communication initialization currently takes the first non-localhost IP address as the machine’s IP. A more reliable solution will be to use the hostname used by MPI.
diff --git a/README_graphlab.md b/README_graphlab.md
new file mode 100644
index 0000000000..529ec7b15a
--- /dev/null
+++ b/README_graphlab.md
@@ -0,0 +1,269 @@
+# GraphLab PowerGraph v2.2
+
+## UPDATE: For a significant evolution of this codebase, see GraphLab Create which is available for download at [dato.com](http://dato.com)
+
+## History
+In 2013, the team that created GraphLab PowerGraph started the Seattle-based company, GraphLab, Inc. The learnings from GraphLab PowerGraph and GraphChi projects have culminated into GraphLab Create, a enterprise-class data science platform for data scientists and software engineers that can simplify building and deploying advanced machine learning models as a RESTful predictive service. In January 2015, GraphLab, Inc. was renamed to Dato, Inc. See [dato.com](http://dato.com) for more information. 
+
+## Status
+GraphLab PowerGraph is no longer in active development by the founding team. GraphLab PowerGraph is now supported by the community at [http://forum.dato.com/](http://forum.dato.com/).  
+
+# Introduction
+
+GraphLab PowerGraph is a graph-based, high performance, distributed computation framework written in C++. 
+
+The GraphLab PowerGraph academic project was started in 2009 at Carnegie Mellon University to develop a new parallel computation abstraction tailored to machine learning. GraphLab PowerGraph 1.0 employed shared-memory design. In GraphLab PowerGraph 2.1, the framework was redesigned to target the distributed environment. It addressed the difficulties with real-world power-law graphs and achieved unparalleled performance at the time. In GraphLab PowerGraph 2.2, the Warp System was introduced and provided a new flexible, distributed architecture around fine-grained user-mode threading (fibers). The Warp System allows one to easily extend the abstraction, to improve optimization for example, while also improving usability.
+
+GraphLab PowerGraph is the culmination of 4-years of research and development into graph computation, distributed computing, and machine learning. GraphLab PowerGraph scales to graphs with billions of vertices and edges easily, performing orders of magnitude faster than competing systems. GraphLab PowerGraph combines advances in machine learning algorithms, asynchronous distributed graph computation, prioritized scheduling, and graph placement with optimized low-level system design and efficient data-structures to achieve unmatched performance and scalability in challenging machine learning tasks.
+
+Related is GraphChi, a spin-off project separate from the GraphLab PowerGraph project. GraphChi was designed to run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive) enabling a single desktop computer (actually a Mac Mini) to tackle problems that previously demanded an entire cluster. For more information, see [https://github.com/GraphChi](https://github.com/GraphChi).
+
+# License
+
+
+GraphLab PowerGraph is released under the [Apache 2 license](http://www.apache.org/licenses/LICENSE-2.0.html).
+
+If you use GraphLab PowerGraph in your research, please cite our paper:
+```
+    @inproceedings{Low+al:uai10graphlab,
+      title = {GraphLab: A New Parallel Framework for Machine Learning},
+      author = {Yucheng Low and
+                Joseph Gonzalez and
+                Aapo Kyrola and
+                Danny Bickson and
+                Carlos Guestrin and
+                Joseph M. Hellerstein},
+      booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
+      month = {July},
+      year = {2010}
+    }
+```
+
+# Academic and Conference Papers
+
+Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "[PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/gonzalez)." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
+
+Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin and Joseph M. Hellerstein (2012). "[Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud](http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf)." Proceedings of the VLDB Endowment (PVLDB).
+
+Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein (2010). "[GraphLab: A New Parallel Framework for Machine Learning](http://arxiv.org/pdf/1006.4990v1.pdf)." Conference on Uncertainty in Artificial Intelligence (UAI).
+
+Li, Kevin; Gibson, Charles; Ho, David; Zhou, Qi; Kim, Jason; Buhisi, Omar; Brown, Donald E.; Gerber, Matthew, "[Assessment of machine learning algorithms in cloud computing frameworks](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6549501)", Systems and Information Engineering Design Symposium (SIEDS), 2013 IEEE, pp.98,103, 26-26 April 2013
+
+[Towards Benchmarking Graph-Processing Platforms](http://sc13.supercomputing.org/sites/default/files/PostersArchive/post152.html). by Yong Guo (Delft University of Technology), Marcin Biczak (Delft University of Technology), Ana Lucia Varbanescu (University of Amsterdam), Alexandru Iosup (Delft University of Technology), Claudio Martella (VU University Amsterdam), Theodore L. Willke (Intel Corporation), in Super Computing 13
+
+Aapo Kyrola, Guy Blelloch, and Carlos Guestrin (2012). "[GraphChi: Large-Scale Graph computation on Just a PC](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/kyrola)." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
+
+
+# The Software Stack
+
+The GraphLab PowerGraph project consists of a core API and a collection of high-performance machine learning and data mining toolkits built on top. The API is written in C++ and built on top of standard cluster and cloud technologies. Inter-process communication is accomplished over TCP-IP and MPI is used to launch and manage GraphLab PowerGraph programs. Each process is multithreaded to fully utilize the multicore resources available on modern cluster nodes. It supports reading and writing to both Posix and HDFS filesystems.
+
+![GraphLab PowerGraph Software Stack](images/gl_os_software_stack.png "GraphLab Software Stack")
+
+GraphLab PowerGraph has a large selection of machine learning methods already implemented (see /toolkits directory in this repo). You can also implement your own algorithms on top of the graph programming API (a certain degree of C++ knowledge is required).
+
+GraphLab PowerGraph Feature Highlights
+--------------------------------------
+
+* **Unified multicore/distributed API:** write once run anywhere 
+
+* **Tuned for performance:** optimized C++ execution engine leverages extensive multi-threading and asynchronous IO 
+
+* **Scalable:** Run on large cluster deployments by intelligently placing data and computation 
+
+* **HDFS Integration:** Access your data directly from HDFS 
+
+* **Powerful Machine Learning Toolkits:** Tackle challenging machine learning problems with ease
+
+## Building
+
+The current version of GraphLab PowerGraph was tested on Ubuntu Linux 64-bit 10.04,  11.04 (Natty), 12.04 (Pangolin) as well as Mac OS X 10.7 (Lion) and Mac OS X 10.8 (Mountain Lion). It requires a 64-bit operating system.
+
+# Dependencies
+
+To simplify installation, GraphLab PowerGraph currently downloads and builds most of its required dependencies using CMake’s External Project feature. This also means the first build could take a long time.
+
+There are however, a few dependencies which must be manually satisfied.
+
+* On OS X: g++ (>= 4.2) or clang (>= 3.0) [Required]
+  +  Required for compiling GraphLab.
+
+* On Linux: g++ (>= 4.3) or clang (>= 3.0) [Required]
+  +  Required for compiling GraphLab.
+
+* *nix build tools: patch, make [Required]
+   +  Should come with most Mac/Linux systems by default. Recent Ubuntu version will require to install the build-essential package.
+
+* zlib [Required]
+   +   Comes with most Mac/Linux systems by default. Recent Ubuntu version will require the zlib1g-dev package.
+
+* Open MPI or MPICH2 [Strongly Recommended]
+   + Required for running GraphLab distributed. 
+
+* JDK 6 or greater [Optional]
+   + Required for HDFS support 
+
+## Satisfying Dependencies on Mac OS X
+
+Installing XCode with the command line tools (in XCode 4.3 you have to do this manually in the XCode Preferences -&gt; Download pane), satisfies all of these dependencies.
+
+## Satisfying Dependencies on Ubuntu
+
+All the dependencies can be satisfied from the repository:
+
+    sudo apt-get update
+    sudo apt-get install gcc g++ build-essential libopenmpi-dev openmpi-bin default-jdk cmake zlib1g-dev git
+
+# Downloading GraphLab PowerGraph
+
+You can download GraphLab PowerGraph directly from the Github Repository. Github also offers a zip download of the repository if you do not have git.
+
+The git command line for cloning the repository is:
+
+    git clone https://github.com/graphlab-code/graphlab.git
+    cd graphlab
+
+
+# Compiling and Running
+
+```
+./configure
+```
+
+In the graphlabapi directory, will create two sub-directories, release/ and debug/ . cd into either of these directories and running make will build the release or the debug versions respectively. Note that this will compile all of GraphLab, including all toolkits. Since some toolkits require additional dependencies (for instance, the Computer Vision toolkit needs OpenCV), this will also download and build all optional dependencies.
+
+We recommend using make’s parallel build feature to accelerate the compilation process. For instance:
+
+```
+make -j4
+```
+
+will perform up to 4 build tasks in parallel. When building in release/ mode, GraphLab does require a large amount of memory to compile with the heaviest toolkit requiring 1GB of RAM.
+
+Alternatively, if you know exactly which toolkit you want to build, cd into the toolkit’s sub-directory and running make, will be significantly faster as it will only download the minimal set of dependencies for that toolkit. For instance:
+
+```
+cd release/toolkits/graph_analytics
+make -j4
+```
+
+will build only the Graph Analytics toolkit and will not need to obtain OpenCV, Eigen, etc used by the other toolkits.
+
+## Compilation Issues
+If you encounter issues please post the following on the [GraphLab forum](http://forum.graphlab.com).
+
+* detailed description of the problem you are facing
+* OS and OS version
+* output of uname -a
+* hardware of the machine
+* utput of g++ -v and clang++ -v
+* contents of graphlab/config.log and graphlab/configure.deps
+
+# Writing Your Own Apps
+
+There are two ways to write your own apps.
+
+* To work in the GraphLab PowerGraph source tree, (recommended)
+* Install and link against Graphlab PowerGraph (not recommended)
+
+
+## 1:  Working in the GraphLab PowerGraph Source Tree
+
+This is the best option if you just want to try using GraphLab PowerGraph quickly. GraphLab PowerGraph
+uses the CMake build system which enables you to quickly create
+a C++ project without having to write complicated Makefiles. 
+
+1. Create your own sub-directory in the apps/ directory. for example apps/my_app
+   
+2. Create a CMakeLists.txt in apps/my_app containing the following lines:
+
+    project(GraphLab) 
+    add_graphlab_executable(my_app [List of cpp files space separated]) 
+
+3. Substituting the right values into the square brackets. For instance:
+
+    project(GraphLab) 
+    add_graphlab_executable(my_app my_app.cpp) 
+
+4. Running "make" in the apps/ directory of any of the build directories 
+should compile your app. If your app does not show up, try running
+
+    cd [the GraphLab API directory]
+    touch apps/CMakeLists.txt
+
+
+## 2: Installing and Linking Against GraphLab PowerGraph
+
+To install and use GraphLab PowerGraph this way will require your system
+to completely satisfy all remaining dependencies, which GraphLab PowerGraph normally 
+builds automatically. This path is not extensively tested and is 
+**not recommended**
+
+You will require the following additional dependencies
+ - libevent (>=2.0.18)
+ - libjson (>=7.6.0)
+ - libboost (>=1.53)
+ - libhdfs (required for HDFS support)
+ - tcmalloc (optional)
+
+Follow the instructions in the [Compiling] section to build the release/ 
+version of the library. Then cd into the release/ build directory and 
+run make install . This will install the following:
+
+* include/graphlab.hpp
+ +   The primary GraphLab header 
+*  include/graphlab/...
+ +   The folder containing the headers for the rest of the GraphLab library 
+*  lib/libgraphlab.a
+ +   The GraphLab static library.
+    
+Once you have installed GraphLab PowerGraph you can compile your program by running:
+
+```
+g++ -O3 -pthread -lzookeeper_mt -lzookeeper_st -lboost_context -lz -ltcmalloc -levent -levent_pthreads -ljson -lboost_filesystem -lboost_program_options -lboost_system -lboost_iostreams -lboost_date_time -lhdfs -lgraphlab hello_world.cpp
+```
+    
+If you have compiled with MPI support, you will also need
+
+   -lmpi -lmpi++ 
+   
+# Tutorials
+See [tutorials](TUTORIALS.md)
+
+# Datasets
+The following are data sets links we found useful when getting started with GraphLab PowerGraph.
+
+##Social Graphs
+* [Stanford Large Network Dataset (SNAP)](http://snap.stanford.edu/data/index.html)
+* [Laboratory for Web Algorithms](http://law.di.unimi.it/datasets.php)
+
+##Collaborative Filtering
+* [Million Song dataset](http://labrosa.ee.columbia.edu/millionsong/)
+* [Movielens dataset GroupLens](http://grouplens.org/datasets/movielens/)
+* [KDD Cup 2012 by Tencent, Inc.](https://www.kddcup2012.org/)
+* [University of Florida sparse matrix collection](http://www.cise.ufl.edu/research/sparse/matrices/)
+
+##Classification
+* [Airline on time performance](http://stat-computing.org/dataexpo/2009/)
+* [SF restaurants](http://missionlocal.org/san-francisco-restaurant-health-inspections/)
+
+##Misc
+* [Amazon Web Services public datasets](http://aws.amazon.com/datasets)
+  
+# Release Notes
+##### **map_reduce_vertices/edges and transform_vertices/edges are not parallelized on Mac OS X**
+
+These operations currently rely on OpenMP for parallelism.
+
+On OS X 10.6 and earlier, gcc 4.2 has several OpenMP bugs and is not stable enough to use reliably.
+
+On OS X 10.7, the clang
+++ compiler does not yet support OpenMP.
+
+##### **map_reduce_vertices/edges and transform_vertices/edges use a lot more processors than what was specified in –ncpus**
+
+This is related to the question above. While there is a simple temporary solution (omp_set_num_threads), we intend to properly resolve the issue by not using openMP at all.
+
+##### **Unable to launch distributed GraphLab when each machine has multiple network interfaces**
+
+The communication initialization currently takes the first non-localhost IP address as the machine’s IP. A more reliable solution will be to use the hostname used by MPI.
\ No newline at end of file
diff --git a/TUTORIALS.md b/TUTORIALS.md
index 9e1077093e..ad476d93a6 100644
--- a/TUTORIALS.md
+++ b/TUTORIALS.md
@@ -1,6 +1,6 @@
 # GraphLab PowerGraph Tutorials
 
-##Table of Contents
+## Table of Contents
 * [Deploying on AWS EC2 Cluster](#ec2)
 * [Deploying in a Cluster](#cluster)
 * [Deploying on a single multicore machine](#multicore)
@@ -142,7 +142,7 @@ wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validat
 ```
 Now run GraphLab:
 
-````
+```
 mpiexec -n 2 -hostfile ~/machines /path/to/als  --matrix /some/ns/folder/smallnetflix/ --max_iter=3 --ncpus=1 --minval=1 --maxval=5 --predictions=out_file
 ```
 Where -n is the number of MPI nodes, and –ncpus is the number of deployed cores on each MPI node.
@@ -240,7 +240,7 @@ or:
 Check that all machines have access to, or are using the same binary
 
 <a id="multicore"></a>
-#Deployment on a single multicore machine
+# Deployment on a single multicore machine
 
 ## Preliminaries:
 
@@ -352,7 +352,7 @@ Here is a more detailed explanation of the benchmarking process. The benchmarkin
 5. In case you would like to benchmark a different algorithm, you can add an additional youralgo_demo section into the gl_ec2.py script.
 6. In case you would like to bechmark a regular instance, simply change the following line in gl_ec2.py from
 
-````
+```
 ./gl-ec2 -i ~/.ssh/amazonec2.pem -k amazonec2 -a hpc -s $MAX_SLAVES -t cc2.8xlarge launch hpctest
 ```
 to:
@@ -436,7 +436,7 @@ Previous to the program execution, the graph is first loaded into memory and par
 
 or
 
-````
+```
 --graph_opts="ingress=grid" # works for power of 2 sized cluster i.e. 2,4,8,.. machines
 ```
 
diff --git a/apps/example/CMakeLists.txt b/apps/example/CMakeLists.txt
index 5a13bda9b6..481c208165 100644
--- a/apps/example/CMakeLists.txt
+++ b/apps/example/CMakeLists.txt
@@ -1,2 +1,5 @@
 project(example)
 add_graphlab_executable(hello_world hello_world.cpp)
+
+project(word_search)
+add_graphlab_executable(word_search word_search.cpp)
diff --git a/apps/example/word_search.cpp b/apps/example/word_search.cpp
new file mode 100644
index 0000000000..beda2d7380
--- /dev/null
+++ b/apps/example/word_search.cpp
@@ -0,0 +1,301 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.02  implement word search application for testing bipartite-aware partitiong
+ *              with affinity
+ *
+ */
+
+
+
+#include <vector>
+#include <string>
+#include <fstream>
+
+#include <graphlab.hpp>
+
+bool USE_DELTA_CACHE = false;
+
+struct vertex_t {
+  bool is_doc;
+  int count;
+  std::vector<std::string> str_vec; // only for doc
+
+  vertex_t(std::vector<std::string>& vec) : count(0)
+    { str_vec = vec; is_doc = true; }
+
+  vertex_t() : count(0)
+    { is_doc = false; }
+
+  void save(graphlab::oarchive& arc) const { 
+    arc << is_doc << count << str_vec ;         
+  }
+
+  /** \brief Load the vertex data from a binary archive */
+  void load(graphlab::iarchive& arc) { 
+    arc >> is_doc >> count >> str_vec ;
+  }
+
+}; // end of edge data
+
+typedef vertex_t vertex_data_type;
+
+// There is no edge data in the pagerank application
+typedef graphlab::empty edge_data_type;
+
+// The graph type is determined by the vertex and edge data types
+typedef graphlab::distributed_graph<vertex_data_type, edge_data_type> graph_type;
+
+
+inline graph_type::vertex_type
+get_other_vertex(graph_type::edge_type& edge, 
+                 const graph_type::vertex_type& vertex) {
+  return vertex.id() == edge.source().id()? edge.target() : edge.source();
+}; // end of get_other_vertex
+
+inline bool graph_loader(graph_type& graph,
+                         const std::string& filename,
+                         const std::string& line) {
+  ASSERT_FALSE(line.empty()); 
+  namespace qi = boost::spirit::qi;
+  namespace ascii = boost::spirit::ascii;
+  namespace phoenix = boost::phoenix;
+
+  graph_type::vertex_id_type source_id(-1), target_id(-1);
+  std::vector<std::string> str_vec;
+
+  if(boost::ends_with(filename,".edge")){
+    const bool success = qi::phrase_parse
+    (line.begin(), line.end(),       
+     //  Begin grammar
+     (
+      qi::ulong_[phoenix::ref(source_id) = qi::_1] >> -qi::char_(',') >>
+      qi::ulong_[phoenix::ref(target_id) = qi::_1] 
+      )
+     ,
+     //  End grammar
+     ascii::space); 
+      
+    if(!success) return false;
+    graph.add_edge(source_id, target_id);
+    return true; // successful load
+  } 
+ 
+
+  const bool success = qi::parse
+    (line.begin(), line.end(),       
+     //  Begin grammar
+     (
+      qi::omit[qi::ulong_[phoenix::ref(source_id) = qi::_1] ]>> qi::omit[+qi::space] >> //-qi::char_(',') >>
+       +qi::alnum % qi::omit[+qi::space]
+       ),
+        str_vec);
+      
+  if(!success) return false;
+  
+  vertex_data_type v_data(str_vec);
+  graph.add_vertex(source_id, v_data); 
+  return true; // successful load
+} // end of graph_loader
+
+
+class wsearch :
+
+  public graphlab::ivertex_program<graph_type, int> {
+
+public:
+
+  edge_dir_type gather_edges(icontext_type& context,
+                              const vertex_type& vertex) const {
+    if(vertex.data().is_doc) return graphlab::NO_EDGES;
+    else return graphlab::ALL_EDGES;
+  } // end of Gather edges
+
+
+  int gather(icontext_type& context,
+              const vertex_type& vertex, edge_type& edge) const {
+    if(vertex.data().is_doc) return 0;
+    else {
+      int count = 0;
+      vertex_type doc = get_other_vertex(edge, vertex);
+      for(int i = 0; i < doc.data().str_vec.size(); i++)
+        if(doc.data().str_vec[i] == "google") count ++;
+      return count;
+    }
+  }
+
+  void apply(icontext_type& context, vertex_type& vertex,
+              const gather_type& total) {
+    vertex.data().count = total;
+  }
+
+  edge_dir_type scatter_edges(icontext_type& context,
+                                const vertex_type& vertex) const {
+    return graphlab::NO_EDGES;
+  }
+
+  void scatter(icontext_type& context, const vertex_type& vertex,
+               edge_type& edge) const { }
+
+  void save(graphlab::oarchive& oarc) const { }
+
+  void load(graphlab::iarchive& iarc) { }
+
+};
+
+int map_count(const graph_type::vertex_type& v) { return v.data().count; }
+
+edge_data_type 
+signal_target(graphlab::omni_engine<wsearch>::icontext_type& context,
+               graph_type::edge_type edge) {
+  context.signal(edge.target());
+  return graphlab::empty();
+}
+
+int main(int argc, char** argv) {
+  // Initialize control plain using mpi
+  graphlab::mpi_tools::init(argc, argv);
+  graphlab::distributed_control dc;
+  global_logger().set_log_level(LOG_INFO);
+
+  // Parse command line options -----------------------------------------------
+  graphlab::command_line_options clopts("Word serach application for data affinity.");
+  std::string graph_dir;
+  std::string exec_type = "synchronous";
+  clopts.attach_option("graph", graph_dir,
+                       "The graph file.  If none is provided "
+                       "then a toy graph will be created");
+  clopts.add_positional("graph");
+  clopts.attach_option("engine", exec_type,
+                       "The engine type synchronous or asynchronous");
+  clopts.attach_option("use_delta", USE_DELTA_CACHE,
+                       "Use the delta cache to reduce time in gather.");
+
+  if(!clopts.parse(argc, argv)) {
+    dc.cout() << "Error in parsing command line arguments." << std::endl;
+    return EXIT_FAILURE;
+  }
+
+
+  // Enable gather caching in the engine
+  clopts.get_engine_args().set_option("use_cache", USE_DELTA_CACHE);
+
+  // Build the graph ----------------------------------------------------------
+  dc.cout() << "Loading graph." << std::endl;
+  graphlab::timer timer; 
+  graph_type graph(dc, clopts);
+  if (graph_dir.length() > 0) { // Load the graph from a file
+    graph.load(graph_dir, graph_loader); 
+  } else {
+    clopts.print_description();
+    return 0;
+  }
+  dc.cout() << "Loading graph. Finished in " 
+    << timer.current_time() << std::endl;
+
+  size_t bytes_sent = dc.bytes_sent();
+  size_t calls_sent = dc.calls_sent();
+  size_t network_bytes_sent = dc.network_bytes_sent();
+  size_t bytes_received = dc.bytes_received();
+  size_t calls_received = dc.calls_received();
+  dc.cout() << "load_Bytes_Sent: "     << bytes_sent        << std::endl;
+  dc.cout() << "load_Calls_Sent: "     << calls_sent        << std::endl;
+  dc.cout() << "load_Network_Sent: "   << network_bytes_sent<< std::endl;
+  dc.cout() << "load_Bytes_Received: " << bytes_received    << std::endl;
+  dc.cout() << "load_Calls_Received: " << calls_received    << std::endl;
+
+
+  // must call finalize before querying the graph
+  dc.cout() << "Finalizing graph." << std::endl;
+  timer.start();
+  graph.finalize();
+  dc.cout() << "Finalizing graph. Finished in " 
+    << timer.current_time() << std::endl;
+
+  bytes_sent = dc.bytes_sent() - bytes_sent;
+  calls_sent = dc.calls_sent() - calls_sent;
+  network_bytes_sent = dc.network_bytes_sent() - network_bytes_sent;
+  bytes_received = dc.bytes_received() - bytes_received;
+  calls_received = dc.calls_received() - calls_received;
+  dc.cout() << "finalize_Bytes_Sent: "     << bytes_sent        << std::endl;
+  dc.cout() << "finalize_Calls_Sent: "     << calls_sent        << std::endl;
+  dc.cout() << "finalize_Network_Sent: "   << network_bytes_sent<< std::endl;
+  dc.cout() << "finalize_Bytes_Received: " << bytes_received    << std::endl;
+  dc.cout() << "finalize_Calls_Received: " << calls_received    << std::endl;
+
+
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
+
+
+  // Running The Engine -------------------------------------------------------
+  graphlab::omni_engine<wsearch> engine(dc, graph, exec_type, clopts);
+
+  // Initialize the vertex data  
+  engine.map_reduce_edges<graphlab::empty>(signal_target);
+
+  bytes_sent = dc.bytes_sent() - bytes_sent;
+  calls_sent = dc.calls_sent() - calls_sent;
+  network_bytes_sent = dc.network_bytes_sent() - network_bytes_sent;
+  bytes_received = dc.bytes_received() - bytes_received;
+  calls_received = dc.calls_received() - calls_received;
+  dc.cout() << "before_start_Bytes_Sent: "     << bytes_sent        << std::endl;
+  dc.cout() << "before_start_Calls_Sent: "     << calls_sent        << std::endl;
+  dc.cout() << "before_start_Network_Sent: "   << network_bytes_sent<< std::endl;
+  dc.cout() << "before_start_Bytes_Received: " << bytes_received    << std::endl;
+  dc.cout() << "before_start_Calls_Received: " << calls_received    << std::endl;
+
+
+  //engine.signal_all();
+  timer.start();
+  engine.start();
+  const double runtime = timer.current_time();
+  dc.cout() << "----------------------------------------------------------"
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
+
+  const int total_count = graph.map_reduce_vertices<int>(map_count);
+  std::cout << "Total count: " << total_count << std::endl;
+
+  bytes_sent = dc.bytes_sent() - bytes_sent;
+  calls_sent = dc.calls_sent() - calls_sent;
+  network_bytes_sent = dc.network_bytes_sent() - network_bytes_sent;
+  bytes_received = dc.bytes_received() - bytes_received;
+  calls_received = dc.calls_received() - calls_received;
+  dc.cout() << "compute_Bytes_Sent: "     << bytes_sent        << std::endl;
+  dc.cout() << "compute_Calls_Sent: "     << calls_sent        << std::endl;
+  dc.cout() << "compute_Network_Sent: "   << network_bytes_sent<< std::endl;
+  dc.cout() << "compute_Bytes_Received: " << bytes_received    << std::endl;
+  dc.cout() << "compute_Calls_Received: " << calls_received    << std::endl;
+
+  // Tear-down communication layer and quit -----------------------------------
+  graphlab::mpi_tools::finalize();
+  return EXIT_SUCCESS;
+} // End of main
+
+
+// We render this entire program in the documentation
+
+
diff --git a/images/hybrid_cut.png b/images/hybrid_cut.png
new file mode 100755
index 0000000000..10acccb7ce
Binary files /dev/null and b/images/hybrid_cut.png differ
diff --git a/images/hybrid_engine.png b/images/hybrid_engine.png
new file mode 100755
index 0000000000..6b2fa17ae0
Binary files /dev/null and b/images/hybrid_engine.png differ
diff --git a/license/LICENSE_prepend.txt b/license/LICENSE_prepend.txt
index 55c9bf7589..be781ca9b8 100644
--- a/license/LICENSE_prepend.txt
+++ b/license/LICENSE_prepend.txt
@@ -1,3 +1,25 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ */
+
 /*  
  * Copyright (c) 2009 Carnegie Mellon University. 
  *     All rights reserved.
diff --git a/src/graphlab/engine/async_consistent_engine.hpp b/src/graphlab/engine/async_consistent_engine.hpp
index 4e169996dc..1b533154e8 100644
--- a/src/graphlab/engine/async_consistent_engine.hpp
+++ b/src/graphlab/engine/async_consistent_engine.hpp
@@ -831,7 +831,10 @@ namespace graphlab {
         logstream(LOG_DEBUG) << rmi.procid() << "-" << threadid <<  ": "
                              << "\tTermination Double Checked" << std::endl;
 
-        if (!endgame_mode) logstream(LOG_EMPH) << "Endgame mode\n";
+        if (!endgame_mode) 
+          logstream(LOG_EMPH) << rmi.procid() << " Endgame mode "
+                              << (timer::approx_time_seconds() - engine_start_time)
+                              << std::endl;
         endgame_mode = true;
         // put everyone in endgame
         for (procid_t i = 0;i < rmi.dc().numprocs(); ++i) {
@@ -1012,7 +1015,7 @@ namespace graphlab {
       const typename graph_type::vertex_record& rec = graph.l_get_vertex_record(lvid);
       vertex_id_type vid = rec.gvid;
       char task_time_data[sizeof(timer)];
-      timer* task_time;
+      timer* task_time = NULL;
       if (track_task_time) {
         // placement new to create the timer
         task_time = reinterpret_cast<timer*>(task_time_data);
diff --git a/src/graphlab/engine/omni_engine.hpp b/src/graphlab/engine/omni_engine.hpp
index 1f38a1f78f..ccc0ae5ac0 100644
--- a/src/graphlab/engine/omni_engine.hpp
+++ b/src/graphlab/engine/omni_engine.hpp
@@ -1,3 +1,29 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.04  add calling to asynchronous engine of powerlyra
+ * 2013.11  add calling to synchronous engine of powerlyra
+ *
+ */
+
 /**  
  * Copyright (c) 2009 Carnegie Mellon University. 
  *     All rights reserved.
@@ -31,6 +57,8 @@
 #include <graphlab/engine/iengine.hpp>
 #include <graphlab/engine/synchronous_engine.hpp>
 #include <graphlab/engine/async_consistent_engine.hpp>
+#include <graphlab/engine/powerlyra_sync_engine.hpp>
+#include <graphlab/engine/powerlyra_async_engine.hpp>
 
 namespace graphlab {
 
@@ -136,6 +164,16 @@ namespace graphlab {
      */
     typedef async_consistent_engine<VertexProgram> async_consistent_engine_type;
 
+    /**
+     * \brief the type of powerlyra synchronous engine
+     */
+    typedef powerlyra_sync_engine<VertexProgram> powerlyra_sync_engine_type;
+
+    /**
+     * \brief the type of asynchronous engine
+     */
+    typedef powerlyra_async_engine<VertexProgram> powerlyra_async_engine_type;
+
 
 
   private:
@@ -193,6 +231,12 @@ namespace graphlab {
       } else if(engine_type == "async" || engine_type == "asynchronous") {
         logstream(LOG_INFO) << "Using the Asynchronous engine." << std::endl;
         engine_ptr = new async_consistent_engine_type(dc, graph, new_options);
+      } else if(engine_type == "plsync" || engine_type == "powerlyra_synchronous") {
+        logstream(LOG_INFO) << "Using the PowerLyra Synchronous engine." << std::endl;
+        engine_ptr = new powerlyra_sync_engine_type(dc, graph, new_options);
+      } else if(engine_type == "plasync" || engine_type == "powerlyra_asynchronous") {
+        logstream(LOG_INFO) << "Using the PowerLyra Asynchronous engine." << std::endl;
+        engine_ptr = new powerlyra_async_engine_type(dc, graph, new_options);
       } else {
         logstream(LOG_FATAL) << "Invalid engine type: " << engine_type << std::endl;
       }
diff --git a/src/graphlab/engine/powerlyra_async_engine.hpp b/src/graphlab/engine/powerlyra_async_engine.hpp
new file mode 100755
index 0000000000..6ba80f0b48
--- /dev/null
+++ b/src/graphlab/engine/powerlyra_async_engine.hpp
@@ -0,0 +1,1322 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.04  implement asynchronous engine of powerlyra
+ *
+ */
+
+
+
+#ifndef GRAPHLAB_POWERLYRA_ASYNC_ENGINE_HPP
+#define GRAPHLAB_POWERLYRA_ASYNC_ENGINE_HPP
+
+#include <deque>
+#include <boost/bind.hpp>
+
+#include <graphlab/scheduler/ischeduler.hpp>
+#include <graphlab/scheduler/scheduler_factory.hpp>
+#include <graphlab/scheduler/get_message_priority.hpp>
+#include <graphlab/vertex_program/ivertex_program.hpp>
+#include <graphlab/vertex_program/icontext.hpp>
+#include <graphlab/vertex_program/context.hpp>
+#include <graphlab/engine/iengine.hpp>
+#include <graphlab/engine/execution_status.hpp>
+#include <graphlab/options/graphlab_options.hpp>
+#include <graphlab/rpc/dc_dist_object.hpp>
+#include <graphlab/engine/distributed_chandy_misra.hpp>
+#include <graphlab/engine/message_array.hpp>
+
+#include <graphlab/util/tracepoint.hpp>
+#include <graphlab/util/memory_info.hpp>
+#include <graphlab/util/generics/conditional_addition_wrapper.hpp>
+#include <graphlab/rpc/distributed_event_log.hpp>
+#include <graphlab/parallel/fiber_group.hpp>
+#include <graphlab/parallel/fiber_control.hpp>
+#include <graphlab/rpc/fiber_async_consensus.hpp>
+#include <graphlab/aggregation/distributed_aggregator.hpp>
+#include <graphlab/parallel/fiber_remote_request.hpp>
+#include <graphlab/macros_def.hpp>
+
+
+
+namespace graphlab {
+
+
+  /**
+   * \ingroup engines
+   *
+   * \brief The asynchronous consistent engine executed vertex programs
+   * asynchronously and can ensure mutual exclusion such that adjacent vertices
+   * are never executed simultaneously. The default mode is "factorized"
+   * consistency in which only individual gathers/applys/
+   * scatters are guaranteed to be consistent, but this can be strengthened to
+   * provide full mutual exclusion.
+   *
+   *
+   * \tparam VertexProgram
+   * The user defined vertex program type which should implement the
+   * \ref graphlab::ivertex_program interface.
+   *
+   * ### Execution Semantics
+   *
+   * On start() the \ref graphlab::ivertex_program::init function is invoked
+   * on all vertex programs in parallel to initialize the vertex program,
+   * vertex data, and possibly signal vertices.
+   *
+   * After which, the engine spawns a collection of threads where each thread
+   * individually performs the following tasks:
+   * \li Extract a message from the scheduler.
+   * \li Perform distributed lock acquisition on the vertex which is supposed
+   * to receive the message. The lock system enforces that no neighboring
+   * vertex is executing at the same time. The implementation is based
+   * on the Chandy-Misra solution to the dining philosophers problem.
+   * (Chandy, K.M.; Misra, J. (1984). The Drinking Philosophers Problem.
+   *  ACM Trans. Program. Lang. Syst)
+   * \li Once lock acquisition is complete,
+   *  \ref graphlab::ivertex_program::init is called on the vertex
+   * program. As an optimization, any messages sent to this vertex
+   * before completion of lock acquisition is merged into original message
+   * extracted from the scheduler.
+   * \li Execute the gather on the vertex program by invoking
+   * the user defined \ref graphlab::ivertex_program::gather function
+   * on the edge direction returned by the
+   * \ref graphlab::ivertex_program::gather_edges function.  The gather
+   * functions can modify edge data but cannot modify the vertex
+   * program or vertex data and can be executed on multiple
+   * edges in parallel.
+   * * \li Execute the apply function on the vertex-program by
+   * invoking the user defined \ref graphlab::ivertex_program::apply
+   * function passing the sum of the gather functions.  If \ref
+   * graphlab::ivertex_program::gather_edges returns no edges then
+   * the default gather value is passed to apply.  The apply function
+   * can modify the vertex program and vertex data.
+   * \li Execute the scatter on the vertex program by invoking
+   * the user defined \ref graphlab::ivertex_program::scatter function
+   * on the edge direction returned by the
+   * \ref graphlab::ivertex_program::scatter_edges function.  The scatter
+   * functions can modify edge data but cannot modify the vertex
+   * program or vertex data and can be executed on multiple
+   * edges in parallel.
+   * \li Release all locks acquired in the lock acquisition stage,
+   * and repeat until the scheduler is empty.
+   *
+   * The engine threads multiplexes the above procedure through a secondary
+   * internal queue, allowing an arbitrary large number of vertices to
+   * begin processing at the same time.
+   *
+   * ### Construction
+   *
+   * The asynchronous consistent engine is constructed by passing in a
+   * \ref graphlab::distributed_control object which manages coordination
+   * between engine threads and a \ref graphlab::distributed_graph object
+   * which is the graph on which the engine should be run.  The graph should
+   * already be populated and cannot change after the engine is constructed.
+   * In the distributed setting all program instances (running on each machine)
+   * should construct an instance of the engine at the same time.
+   *
+   * Computation is initiated by signaling vertices using either
+   * \ref graphlab::powerlyra_async_engine::signal or
+   * \ref graphlab::powerlyra_async_engine::signal_all.  In either case all
+   * machines should invoke signal or signal all at the same time.  Finally,
+   * computation is initiated by calling the
+   * \ref graphlab::powerlyra_async_engine::start function.
+   *
+   * ### Example Usage
+   *
+   * The following is a simple example demonstrating how to use the engine:
+   * \code
+   * #include <graphlab.hpp>
+   *
+   * struct vertex_data {
+   *   // code
+   * };
+   * struct edge_data {
+   *   // code
+   * };
+   * typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type;
+   * typedef float gather_type;
+   * struct pagerank_vprog :
+   *   public graphlab::ivertex_program<graph_type, gather_type> {
+   *   // code
+   * };
+   *
+   * int main(int argc, char** argv) {
+   *   // Initialize control plain using mpi
+   *   graphlab::mpi_tools::init(argc, argv);
+   *   graphlab::distributed_control dc;
+   *   // Parse command line options
+   *   graphlab::command_line_options clopts("PageRank algorithm.");
+   *   std::string graph_dir;
+   *   clopts.attach_option("graph", &graph_dir, graph_dir,
+   *                        "The graph file.");
+   *   if(!clopts.parse(argc, argv)) {
+   *     std::cout << "Error in parsing arguments." << std::endl;
+   *     return EXIT_FAILURE;
+   *   }
+   *   graph_type graph(dc, clopts);
+   *   graph.load_structure(graph_dir, "tsv");
+   *   graph.finalize();
+   *   std::cout << "#vertices: " << graph.num_vertices()
+   *             << " #edges:" << graph.num_edges() << std::endl;
+   *   graphlab::powerlyra_async_engine<pagerank_vprog> engine(dc, graph, clopts);
+   *   engine.signal_all();
+   *   engine.start();
+   *   std::cout << "Runtime: " << engine.elapsed_seconds();
+   *   graphlab::mpi_tools::finalize();
+   * }
+   * \endcode
+   *
+   * \see graphlab::omni_engine
+   * \see graphlab::synchronous_engine
+   *
+   * <a name=engineopts>Engine Options</a>
+   * =========================
+   * The asynchronous engine supports several engine options which can
+   * be set as command line arguments using \c --engine_opts :
+   *
+   * \li \b timeout (default: infinity) Maximum time in seconds the engine will
+   * run for. The actual runtime may be marginally greater as the engine
+   * waits for all threads and processes to flush all active tasks before
+   * returning.
+   * \li \b factorized (default: true) Set to true to weaken the consistency
+   * model to factorized consistency where only individual gather/apply/scatter
+   * calls are guaranteed to be locally consistent. Can produce massive
+   * increases in throughput at a consistency penalty.
+   * \li \b nfibers (default: 10000) Number of fibers to use
+   * \li \b stacksize (default: 16384) Stacksize of each fiber.
+   */
+  template<typename VertexProgram>
+  class powerlyra_async_engine: public iengine<VertexProgram> {
+
+  public:
+    /**
+     * \brief The user defined vertex program type. Equivalent to the
+     * VertexProgram template argument.
+     *
+     * The user defined vertex program type which should implement the
+     * \ref graphlab::ivertex_program interface.
+     */
+    typedef VertexProgram vertex_program_type;
+
+    /**
+     * \brief The user defined type returned by the gather function.
+     *
+     * The gather type is defined in the \ref graphlab::ivertex_program
+     * interface and is the value returned by the
+     * \ref graphlab::ivertex_program::gather function.  The
+     * gather type must have an <code>operator+=(const gather_type&
+     * other)</code> function and must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::gather_type gather_type;
+
+    /**
+     * \brief The user defined message type used to signal neighboring
+     * vertex programs.
+     *
+     * The message type is defined in the \ref graphlab::ivertex_program
+     * interface and used in the call to \ref graphlab::icontext::signal.
+     * The message type must have an
+     * <code>operator+=(const gather_type& other)</code> function and
+     * must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::message_type message_type;
+
+    /**
+     * \brief The type of data associated with each vertex in the graph
+     *
+     * The vertex data type must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::vertex_data_type vertex_data_type;
+
+    /**
+     * \brief The type of data associated with each edge in the graph
+     *
+     * The edge data type must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::edge_data_type edge_data_type;
+
+    /**
+     * \brief The type of graph supported by this vertex program
+     *
+     * See graphlab::distributed_graph
+     */
+    typedef typename VertexProgram::graph_type graph_type;
+
+     /**
+     * \brief The type used to represent a vertex in the graph.
+     * See \ref graphlab::distributed_graph::vertex_type for details
+     *
+     * The vertex type contains the function
+     * \ref graphlab::distributed_graph::vertex_type::data which
+     * returns a reference to the vertex data as well as other functions
+     * like \ref graphlab::distributed_graph::vertex_type::num_in_edges
+     * which returns the number of in edges.
+     *
+     */
+    typedef typename graph_type::vertex_type          vertex_type;
+
+    /**
+     * \brief The type used to represent an edge in the graph.
+     * See \ref graphlab::distributed_graph::edge_type for details.
+     *
+     * The edge type contains the function
+     * \ref graphlab::distributed_graph::edge_type::data which returns a
+     * reference to the edge data.  In addition the edge type contains
+     * the function \ref graphlab::distributed_graph::edge_type::source and
+     * \ref graphlab::distributed_graph::edge_type::target.
+     *
+     */
+    typedef typename graph_type::edge_type            edge_type;
+
+    /**
+     * \brief The type of the callback interface passed by the engine to vertex
+     * programs.  See \ref graphlab::icontext for details.
+     *
+     * The context callback is passed to the vertex program functions and is
+     * used to signal other vertices, get the current iteration, and access
+     * information about the engine.
+     */
+    typedef icontext<graph_type, gather_type, message_type> icontext_type;
+
+  private:
+    /// \internal \brief The base type of all schedulers
+    message_array<message_type> messages;
+
+    /** \internal
+     * \brief The true type of the callback context interface which
+     * implements icontext. \see graphlab::icontext graphlab::context
+     */
+    typedef context<powerlyra_async_engine> context_type;
+
+    // context needs access to internal functions
+    friend class context<powerlyra_async_engine>;
+
+    /// \internal \brief The type used to refer to vertices in the local graph
+    typedef typename graph_type::local_vertex_type    local_vertex_type;
+    /// \internal \brief The type used to refer to edges in the local graph
+    typedef typename graph_type::local_edge_type      local_edge_type;
+    /// \internal \brief The type used to refer to vertex IDs in the local graph
+    typedef typename graph_type::lvid_type            lvid_type;
+
+    /// \internal \brief The type of the current engine instantiation
+    typedef powerlyra_async_engine<VertexProgram> engine_type;
+
+    typedef conditional_addition_wrapper<gather_type> conditional_gather_type;
+    
+    /// The RPC interface
+    dc_dist_object<powerlyra_async_engine<VertexProgram> > rmi;
+
+    /// A reference to the active graph
+    graph_type& graph;
+
+    /// A pointer to the lock implementation
+    distributed_chandy_misra<graph_type>* cmlocks;
+
+    /// Per vertex data locks
+    std::vector<simple_spinlock> vertexlocks;
+
+    /// Total update function completion time
+    std::vector<double> total_completion_time;
+
+    /**
+     * \brief This optional vector contains caches of previous gather
+     * contributions for each machine.
+     *
+     * Caching is done locally and therefore a high-degree vertex may
+     * have multiple caches (one per machine).
+     */
+    std::vector<gather_type>  gather_cache;
+
+    /**
+     * \brief A bit indicating if the local gather for that vertex is
+     * available.
+     */
+    dense_bitset has_cache;
+
+    bool use_cache;
+
+    /// Engine threads.
+    fiber_group thrgroup;
+
+    //! The scheduler
+    ischeduler* scheduler_ptr;
+
+    typedef typename iengine<VertexProgram>::aggregator_type aggregator_type;
+    aggregator_type aggregator;
+
+    /// Number of kernel threads
+    size_t ncpus;
+    /// Size of each fiber stack
+    size_t stacksize;
+    /// Number of fibers
+    size_t nfibers;
+    /// set to true if engine is started
+    bool started;
+
+    bool track_task_time;
+    /// A pointer to the distributed consensus object
+    fiber_async_consensus* consensus;
+
+    /**
+     * Used only by the locking subsystem.
+     * to allow the fiber to go to sleep when waiting for the locks to
+     * be ready.
+     */
+    struct vertex_fiber_cm_handle {
+      mutex lock;
+      bool philosopher_ready;
+      size_t fiber_handle;
+    };
+    std::vector<vertex_fiber_cm_handle*> cm_handles;
+
+    dense_bitset program_running;
+    dense_bitset hasnext;
+
+    // Various counters.
+    atomic<uint64_t> programs_executed;
+
+    timer launch_timer;
+
+    /// Defaults to (-1), defines a timeout
+    size_t timed_termination;
+ 
+    /// engine option. Sets to true if factorized consistency is used
+    bool factorized_consistency;
+
+    bool endgame_mode;
+
+    /// The number of try_to_quit
+    long nttqs;
+    
+    /// Time when engine is started
+    float engine_start_time;
+
+    /// True when a force stop is triggered (possibly via a timeout)
+    bool force_stop;
+
+    graphlab_options opts_copy; // local copy of options to pass to
+                                // scheduler construction
+
+    execution_status::status_enum termination_reason;
+
+    std::vector<mutex> aggregation_lock;
+    std::vector<std::deque<std::string> > aggregation_queue;
+  public:
+
+    /**
+     * Constructs an asynchronous consistent distributed engine.
+     * The number of threads to create are read from
+     * \ref graphlab_options::get_ncpus "opts.get_ncpus()". The scheduler to
+     * construct is read from
+     * \ref graphlab_options::get_scheduler_type() "opts.get_scheduler_type()".
+     * The default scheduler
+     * is the queued_fifo scheduler. For details on the scheduler types
+     * \see scheduler_types
+     *
+     *  See the <a href=#engineopts> main class documentation</a> for the
+     *  available engine options.
+     *
+     * \param dc Distributed controller to associate with
+     * \param graph The graph to schedule over. The graph must be fully
+     *              constructed and finalized.
+     * \param opts A graphlab::graphlab_options object containing options and
+     *             parameters for the scheduler and the engine.
+     */
+    powerlyra_async_engine(distributed_control &dc,
+                            graph_type& graph,
+                            const graphlab_options& opts = graphlab_options()) :
+        rmi(dc, this), graph(graph), scheduler_ptr(NULL),
+        aggregator(dc, graph, new context_type(*this, graph)), started(false),
+        engine_start_time(timer::approx_time_seconds()), force_stop(false) {
+      rmi.barrier();
+
+      nfibers = 10000;
+      stacksize = 16384;
+      use_cache = false;
+      factorized_consistency = true;
+      track_task_time = false;
+      timed_termination = (size_t)(-1);
+      termination_reason = execution_status::UNSET;
+      set_options(opts);
+      init();
+      total_completion_time.resize(fiber_control::get_instance().num_workers());
+      init();
+      rmi.barrier();
+    }
+
+  private:
+
+    /**
+     * \internal
+     * Configures the engine with the provided options.
+     * The number of threads to create are read from
+     * opts::get_ncpus(). The scheduler to construct is read from
+     * graphlab_options::get_scheduler_type(). The default scheduler
+     * is the queued_fifo scheduler. For details on the scheduler types
+     * \see scheduler_types
+     */
+    void set_options(const graphlab_options& opts) {
+      rmi.barrier();
+      ncpus = opts.get_ncpus();
+      ASSERT_GT(ncpus, 0);
+      aggregation_lock.resize(opts.get_ncpus());
+      aggregation_queue.resize(opts.get_ncpus());
+      std::vector<std::string> keys = opts.get_engine_args().get_option_keys();
+      foreach(std::string opt, keys) {
+        if (opt == "timeout") {
+          opts.get_engine_args().get_option("timeout", timed_termination);
+          if (rmi.procid() == 0)
+            logstream(LOG_EMPH) << "Engine Option: timeout = " << timed_termination << std::endl;
+        } else if (opt == "factorized") {
+          opts.get_engine_args().get_option("factorized", factorized_consistency);
+          if (rmi.procid() == 0)
+            logstream(LOG_EMPH) << "Engine Option: factorized = " << factorized_consistency << std::endl;
+        } else if (opt == "nfibers") {
+          opts.get_engine_args().get_option("nfibers", nfibers);
+          if (rmi.procid() == 0)
+            logstream(LOG_EMPH) << "Engine Option: nfibers = " << nfibers << std::endl;
+        } else if (opt == "track_task_time") {
+          opts.get_engine_args().get_option("track_task_time", track_task_time);
+          if (rmi.procid() == 0)
+            logstream(LOG_EMPH) << "Engine Option: track_task_time = " << track_task_time<< std::endl;
+        }else if (opt == "stacksize") {
+          opts.get_engine_args().get_option("stacksize", stacksize);
+          if (rmi.procid() == 0)
+            logstream(LOG_EMPH) << "Engine Option: stacksize= " << stacksize << std::endl;
+        } else if (opt == "use_cache") {
+          opts.get_engine_args().get_option("use_cache", use_cache);
+          if (rmi.procid() == 0)
+            logstream(LOG_EMPH) << "Engine Option: use_cache = " << use_cache << std::endl;
+        } else {
+          logstream(LOG_FATAL) << "Unexpected Engine Option: " << opt << std::endl;
+        }
+      }
+      opts_copy = opts;
+      // set a default scheduler if none
+      if (opts_copy.get_scheduler_type() == "") {
+        opts_copy.set_scheduler_type("queued_fifo");
+      }
+
+      // construct scheduler passing in the copy of the options from set_options
+      scheduler_ptr = scheduler_factory::
+                    new_scheduler(graph.num_local_vertices(),
+                                  opts_copy);
+      rmi.barrier();
+
+      // create initial fork arrangement based on the alternate vid mapping
+      if (factorized_consistency == false) {
+        cmlocks = new distributed_chandy_misra<graph_type>(rmi.dc(), graph,
+                                                    boost::bind(&engine_type::lock_ready, this, _1));
+                                                    
+      }
+      else {
+        cmlocks = NULL;
+      }
+
+      // construct the termination consensus object
+      consensus = new fiber_async_consensus(rmi.dc(), nfibers);
+    }
+
+    /**
+     * \internal
+     * Initializes the engine with respect to the associated graph.
+     * This call will initialize all internal and scheduling datastructures.
+     * This function must be called prior to any signal function.
+     */
+    void init() {
+      // construct all the required datastructures
+      // deinitialize performs the reverse
+      graph.finalize();
+      scheduler_ptr->set_num_vertices(graph.num_local_vertices());
+      messages.resize(graph.num_local_vertices());
+      vertexlocks.resize(graph.num_local_vertices());
+      program_running.resize(graph.num_local_vertices());
+      hasnext.resize(graph.num_local_vertices());
+      if (use_cache) {
+        gather_cache.resize(graph.num_local_vertices(), gather_type());
+        has_cache.resize(graph.num_local_vertices());
+        has_cache.clear();
+      }
+      if (!factorized_consistency) {
+        cm_handles.resize(graph.num_local_vertices());
+      }
+      rmi.barrier();
+    }
+
+
+
+  public:
+    ~powerlyra_async_engine() {
+      delete consensus;
+      delete cmlocks;
+      delete scheduler_ptr;
+    }
+
+
+
+
+    // documentation inherited from iengine
+    size_t num_updates() const {
+      return programs_executed.value;
+    }
+
+
+
+
+
+    // documentation inherited from iengine
+    float elapsed_seconds() const {
+      return timer::approx_time_seconds() - engine_start_time;
+    }
+
+
+    /**
+     * \brief Not meaningful for the asynchronous engine. Returns -1.
+     */
+    int iteration() const { return -1; }
+
+
+/**************************************************************************
+ *                           Signaling Interface                          *
+ **************************************************************************/
+
+  private:
+
+    /**
+     * \internal
+     * This is used to receive a message forwarded from another machine
+     */
+    void rpc_signal(vertex_id_type vid,
+                            const message_type& message) {
+      if (force_stop) return;
+      const lvid_type local_vid = graph.local_vid(vid);
+      double priority;
+      messages.add(local_vid, message, &priority);
+      scheduler_ptr->schedule(local_vid, priority);
+      consensus->cancel();
+    }
+
+    /**
+     * \internal
+     * \brief Signals a vertex with an optional message
+     *
+     * Signals a vertex, and schedules it to be executed in the future.
+     * must be called on a vertex accessible by the current machine.
+     */
+    void internal_signal(const vertex_type& vtx,
+                         const message_type& message = message_type()) {
+      if (force_stop) return;
+      if (started) {
+        const typename graph_type::vertex_record& rec = graph.l_get_vertex_record(vtx.local_id());
+        const procid_t owner = rec.owner;
+        if (endgame_mode) {
+          // fast signal. push to the remote machine immediately
+          if (owner != rmi.procid()) {
+            const vertex_id_type vid = rec.gvid;
+            rmi.remote_call(owner, &engine_type::rpc_signal, vid, message);
+          }
+          else {
+            double priority;
+            messages.add(vtx.local_id(), message, &priority);
+            scheduler_ptr->schedule(vtx.local_id(), priority);
+            consensus->cancel();
+          }
+        }
+        else {
+
+          double priority;
+          messages.add(vtx.local_id(), message, &priority);
+          scheduler_ptr->schedule(vtx.local_id(), priority);
+          consensus->cancel();
+        }
+      }
+      else {
+        double priority;
+        messages.add(vtx.local_id(), message, &priority);
+        scheduler_ptr->schedule(vtx.local_id(), priority);
+        consensus->cancel();
+      }
+    } // end of schedule
+
+
+    /**
+     * \internal
+     * \brief Signals a vertex with an optional message
+     *
+     * Signals a global vid, and schedules it to be executed in the future.
+     * If current machine does not contain the vertex, it is ignored.
+     */
+    void internal_signal_gvid(vertex_id_type gvid,
+                              const message_type& message = message_type()) {
+      if (force_stop) return;
+      if (graph.is_master(gvid)) {
+        internal_signal(graph.vertex(gvid), message);
+      } else {
+        procid_t proc = graph.master(gvid);
+        rmi.remote_call(proc, &powerlyra_async_engine::internal_signal_gvid,
+                             gvid, message);
+      }
+    } 
+
+
+    void rpc_internal_stop() {
+      force_stop = true;
+      termination_reason = execution_status::FORCED_ABORT;
+    }
+
+    /**
+     * \brief Force engine to terminate immediately.
+     *
+     * This function is used to stop the engine execution by forcing
+     * immediate termination.
+     */
+    void internal_stop() {
+      for (procid_t i = 0;i < rmi.numprocs(); ++i) {
+        rmi.remote_call(i, &powerlyra_async_engine::rpc_internal_stop);
+      }
+    }
+
+
+
+    /**
+     * \brief Post a to a previous gather for a give vertex.
+     *
+     * This function is called by the \ref graphlab::context.
+     *
+     * @param [in] vertex The vertex to which to post a change in the sum
+     * @param [in] delta The change in that sum
+     */
+    void internal_post_delta(const vertex_type& vertex,
+                             const gather_type& delta) {
+      if(use_cache) {
+        const lvid_type lvid = vertex.local_id();
+        vertexlocks[lvid].lock();
+        if( has_cache.get(lvid) ) {
+          gather_cache[lvid] += delta;
+        } else {
+          // You cannot add a delta to an empty cache.  A complete
+          // gather must have been run.
+          // gather_cache[lvid] = delta;
+          // has_cache.set_bit(lvid);
+        }
+        vertexlocks[lvid].unlock();
+      }
+    }
+
+    /**
+     * \brief Clear the cached gather for a vertex if one is
+     * available.
+     *
+     * This function is called by the \ref graphlab::context.
+     *
+     * @param [in] vertex the vertex for which to clear the cache
+     */
+    void internal_clear_gather_cache(const vertex_type& vertex) {
+      const lvid_type lvid = vertex.local_id();
+      if(use_cache && has_cache.get(lvid)) {
+        vertexlocks[lvid].lock();
+        gather_cache[lvid] = gather_type();
+        has_cache.clear_bit(lvid);
+        vertexlocks[lvid].unlock();
+      }
+
+    }
+
+  public:
+
+
+
+    void signal(vertex_id_type gvid,
+                const message_type& message = message_type()) {
+      rmi.barrier();
+      internal_signal_gvid(gvid, message);
+      rmi.barrier();
+    }
+
+
+    void signal_all(const message_type& message = message_type(),
+                    const std::string& order = "shuffle") {
+      vertex_set vset = graph.complete_set();
+      signal_vset(vset, message, order);
+    } // end of schedule all
+
+    void signal_vset(const vertex_set& vset,
+                    const message_type& message = message_type(),
+                    const std::string& order = "shuffle") {
+      logstream(LOG_DEBUG) << rmi.procid() << ": Schedule All" << std::endl;
+      // allocate a vector with all the local owned vertices
+      // and schedule all of them.
+      std::vector<vertex_id_type> vtxs;
+      vtxs.reserve(graph.num_local_own_vertices());
+      for(lvid_type lvid = 0;
+          lvid < graph.get_local_graph().num_vertices();
+          ++lvid) {
+        if (graph.l_vertex(lvid).owner() == rmi.procid() &&
+            vset.l_contains(lvid)) {
+          vtxs.push_back(lvid);
+        }
+      }
+
+      if(order == "shuffle") {
+        graphlab::random::shuffle(vtxs.begin(), vtxs.end());
+      }
+      foreach(lvid_type lvid, vtxs) {
+        double priority;
+        messages.add(lvid, message, &priority);
+        scheduler_ptr->schedule(lvid, priority);
+      }
+      rmi.barrier();
+    }
+
+
+  private: 
+
+    /**
+     * Gets a task from the scheduler and the associated message
+     */
+    sched_status::status_enum get_next_sched_task( size_t threadid,
+                                                  lvid_type& lvid,
+                                                  message_type& msg) {
+      while (1) {
+        sched_status::status_enum stat = 
+            scheduler_ptr->get_next(threadid % ncpus, lvid);
+        if (stat == sched_status::NEW_TASK) {
+          if (messages.get(lvid, msg)) return stat;
+          else continue;
+        }
+        return stat;
+      }
+    }
+
+    void set_endgame_mode() {
+        if (!endgame_mode) logstream(LOG_EMPH) << "Endgame mode\n";
+        endgame_mode = true;
+        rmi.dc().set_fast_track_requests(true);
+    } 
+
+    /**
+     * \internal
+     * Called when get_a_task returns no internal task not a scheduler task.
+     * This rechecks the status of the internal task queue and the scheduler
+     * inside a consensus critical section.
+     */
+    bool try_to_quit(size_t threadid,
+                     bool& has_sched_msg,
+                     lvid_type& sched_lvid,
+                     message_type &msg) {
+      if (timer::approx_time_seconds() - engine_start_time > timed_termination) {
+        termination_reason = execution_status::TIMEOUT;
+        force_stop = true;
+      }
+      fiber_control::yield();
+
+      nttqs ++;
+      logstream(LOG_DEBUG) << rmi.procid() << "-" << threadid << ": " << "Termination Attempt " << std::endl;
+      has_sched_msg = false;
+      consensus->begin_done_critical_section(threadid);
+      sched_status::status_enum stat = 
+          get_next_sched_task(threadid, sched_lvid, msg);
+      if (stat == sched_status::EMPTY || force_stop) {
+        logstream(LOG_DEBUG) << rmi.procid() << "-" << threadid <<  ": "
+                             << "\tTermination Double Checked" << std::endl;
+
+        if (!endgame_mode) 
+          logstream(LOG_EMPH) << rmi.procid() << " Endgame mode "
+                              << (timer::approx_time_seconds() - engine_start_time)
+                              << std::endl;
+        endgame_mode = true;
+        // put everyone in endgame
+        for (procid_t i = 0;i < rmi.dc().numprocs(); ++i) {
+          rmi.remote_call(i, &powerlyra_async_engine::set_endgame_mode);
+        } 
+        bool ret = consensus->end_done_critical_section(threadid);
+        if (ret == false) {
+          logstream(LOG_DEBUG) << rmi.procid() << "-" << threadid <<  ": "
+                             << "\tCancelled" << std::endl;
+        } else {
+          logstream(LOG_DEBUG) << rmi.procid() << "-" << threadid <<  ": "
+                             << "\tDying" << " (" << fiber_control::get_tid() << ")" << std::endl;
+        }
+        return ret;
+      } else {
+        logstream(LOG_DEBUG) << rmi.procid() << "-" << threadid <<  ": "
+                             << "\tCancelled by Scheduler Task" << std::endl;
+        consensus->cancel_critical_section(threadid);
+        has_sched_msg = true;
+        return false;
+      }
+    } // end of try to quit
+
+
+    /**
+     * \internal
+     * When all distributed locks are acquired, this function is called
+     * from the chandy misra implementation on the master vertex.
+     * Here, we perform initialization
+     * of the task and switch the vertex to a gathering state
+     */
+    void lock_ready(lvid_type lvid) {
+      cm_handles[lvid]->lock.lock();
+      cm_handles[lvid]->philosopher_ready = true;
+      fiber_control::schedule_tid(cm_handles[lvid]->fiber_handle);
+      cm_handles[lvid]->lock.unlock();
+    }
+
+
+    conditional_gather_type perform_gather(vertex_id_type vid,
+                               vertex_program_type& vprog_) {
+      vertex_program_type vprog = vprog_;
+      lvid_type lvid = graph.local_vid(vid);
+      local_vertex_type local_vertex(graph.l_vertex(lvid));
+      vertex_type vertex(local_vertex);
+      context_type context(*this, graph);
+      edge_dir_type gather_dir = vprog.gather_edges(context, vertex);
+      conditional_gather_type accum;
+
+      //check against the cache
+      if( use_cache && has_cache.get(lvid) ) {
+          accum.set(gather_cache[lvid]);
+          return accum;
+      }
+      // do in edges
+      if(gather_dir == IN_EDGES || gather_dir == ALL_EDGES) {
+        foreach(local_edge_type local_edge, local_vertex.in_edges()) {
+          edge_type edge(local_edge);
+          lvid_type a = edge.source().local_id(), b = edge.target().local_id();
+          vertexlocks[std::min(a,b)].lock();
+          vertexlocks[std::max(a,b)].lock();
+          accum += vprog.gather(context, vertex, edge);
+          vertexlocks[a].unlock();
+          vertexlocks[b].unlock();
+        }
+      } 
+      // do out edges
+      if(gather_dir == OUT_EDGES || gather_dir == ALL_EDGES) {
+        foreach(local_edge_type local_edge, local_vertex.out_edges()) {
+          edge_type edge(local_edge);
+          lvid_type a = edge.source().local_id(), b = edge.target().local_id();
+          vertexlocks[std::min(a,b)].lock();
+          vertexlocks[std::max(a,b)].lock();
+          accum += vprog.gather(context, vertex, edge);
+          vertexlocks[a].unlock();
+          vertexlocks[b].unlock();
+        }
+      } 
+      if (use_cache) {
+        gather_cache[lvid] = accum.value; has_cache.set_bit(lvid);
+      }
+      return accum;
+    }
+
+
+    void perform_scatter_local(lvid_type lvid,
+                               vertex_program_type& vprog) {
+      local_vertex_type local_vertex(graph.l_vertex(lvid));
+      vertex_type vertex(local_vertex);
+      context_type context(*this, graph);
+      edge_dir_type scatter_dir = vprog.scatter_edges(context, vertex);
+      if(scatter_dir == IN_EDGES || scatter_dir == ALL_EDGES) {
+        foreach(local_edge_type local_edge, local_vertex.in_edges()) {
+          edge_type edge(local_edge);
+          lvid_type a = edge.source().local_id(), b = edge.target().local_id();
+          vertexlocks[std::min(a,b)].lock();
+          vertexlocks[std::max(a,b)].lock();
+          vprog.scatter(context, vertex, edge);
+          vertexlocks[a].unlock();
+          vertexlocks[b].unlock();
+        }
+      } 
+      if(scatter_dir == OUT_EDGES || scatter_dir == ALL_EDGES) {
+        foreach(local_edge_type local_edge, local_vertex.out_edges()) {
+          edge_type edge(local_edge);
+          lvid_type a = edge.source().local_id(), b = edge.target().local_id();
+          vertexlocks[std::min(a,b)].lock();
+          vertexlocks[std::max(a,b)].lock();
+          vprog.scatter(context, vertex, edge);
+          vertexlocks[a].unlock();
+          vertexlocks[b].unlock();
+        }
+      } 
+
+      // release locks
+      if (!factorized_consistency) {
+        cmlocks->philosopher_stops_eating_per_replica(lvid);
+      }
+    }
+
+
+    void perform_scatter(vertex_id_type vid,
+                    vertex_program_type& vprog_,
+                    const vertex_data_type& newdata) {
+      vertex_program_type vprog = vprog_;
+      lvid_type lvid = graph.local_vid(vid);
+      vertexlocks[lvid].lock();
+      graph.l_vertex(lvid).data() = newdata;
+      vertexlocks[lvid].unlock();
+      perform_scatter_local(lvid, vprog);
+    }
+
+
+    // make sure I am the only person running.
+    // if returns false, the message has been dropped into the message array.
+    // quit
+    bool get_exclusive_access_to_vertex(const lvid_type lvid,
+                                        const message_type& msg) {
+      vertexlocks[lvid].lock();
+      bool someone_else_running = program_running.set_bit(lvid);
+      if (someone_else_running) {
+        // bad. someone else is here.
+        // drop it into the message array
+        messages.add(lvid, msg);
+        hasnext.set_bit(lvid);
+      } 
+      vertexlocks[lvid].unlock();
+      return !someone_else_running;
+    }
+
+
+
+    // make sure I am the only person running.
+    // if returns false, the message has been dropped into the message array.
+    // quit
+    void release_exclusive_access_to_vertex(const lvid_type lvid) {
+      vertexlocks[lvid].lock();
+      // someone left a next message for me
+      // reschedule it at high priority
+      if (hasnext.get(lvid)) {
+        scheduler_ptr->schedule(lvid, 10000.0);
+        consensus->cancel();
+        hasnext.clear_bit(lvid);
+      }
+      program_running.clear_bit(lvid);
+      vertexlocks[lvid].unlock();
+    }
+
+    bool high_lvid(const lvid_type lvid) {
+      return graph.l_degree_type(lvid) == graph_type::HIGH;
+    }
+
+    /**
+     * \internal
+     * Called when the scheduler returns a vertex to run.
+     * If this function is called with vertex locks acquired, prelocked
+     * should be true. Otherwise it should be false.
+     */
+    void eval_sched_task(const lvid_type lvid,
+                         const message_type& msg) {
+      const typename graph_type::vertex_record& rec = graph.l_get_vertex_record(lvid);
+      vertex_id_type vid = rec.gvid;
+      char task_time_data[sizeof(timer)];
+      timer* task_time = NULL;
+      if (track_task_time) {
+        // placement new to create the timer
+        task_time = reinterpret_cast<timer*>(task_time_data);
+        new (task_time) timer();
+      }
+      // if this is another machine's forward it
+      if (rec.owner != rmi.procid()) {
+        rmi.remote_call(rec.owner, &engine_type::rpc_signal, vid, msg);
+        return;
+      }
+      // I have to run this myself
+      
+      if (!get_exclusive_access_to_vertex(lvid, msg)) return;
+
+      /**************************************************************************/
+      /*                             Acquire Locks                              */
+      /**************************************************************************/
+      if (!factorized_consistency) {
+        // begin lock acquisition
+        cm_handles[lvid] = new vertex_fiber_cm_handle;
+        cm_handles[lvid]->philosopher_ready = false;
+        cm_handles[lvid]->fiber_handle = fiber_control::get_tid();
+        cmlocks->make_philosopher_hungry(lvid);
+        cm_handles[lvid]->lock.lock();
+        while (!cm_handles[lvid]->philosopher_ready) {
+          fiber_control::deschedule_self(&(cm_handles[lvid]->lock.m_mut));
+          cm_handles[lvid]->lock.lock();
+        }
+        cm_handles[lvid]->lock.unlock();
+      }
+
+      /**************************************************************************/
+      /*                             Begin Program                              */
+      /**************************************************************************/
+      context_type context(*this, graph);
+      vertex_program_type vprog = vertex_program_type();
+      local_vertex_type local_vertex(graph.l_vertex(lvid));
+      vertex_type vertex(local_vertex);
+      bool high = high_lvid(lvid);
+
+      /**************************************************************************/
+      /*                               init phase                               */
+      /**************************************************************************/
+      vprog.init(context, vertex, msg);
+
+      /**************************************************************************/
+      /*                              Gather Phase                              */
+      /**************************************************************************/
+      conditional_gather_type gather_result;      
+      std::vector<request_future<conditional_gather_type> > gather_futures;
+      edge_dir_type gather_dir = vprog.gather_edges(context, vertex);
+      
+      if (high || (gather_dir == graphlab::ALL_EDGES) 
+               || (gather_dir == graphlab::OUT_EDGES)) {
+        foreach(procid_t mirror, local_vertex.mirrors()) {
+          gather_futures.push_back(
+              object_fiber_remote_request(rmi, 
+                                          mirror, 
+                                          &powerlyra_async_engine::perform_gather, 
+                                          vid,
+                                          vprog));
+        }
+      }
+      gather_result += perform_gather(vid, vprog);
+      if (high || (gather_dir == graphlab::ALL_EDGES) 
+               || (gather_dir == graphlab::OUT_EDGES)) {
+        for(size_t i = 0;i < gather_futures.size(); ++i) {
+          gather_result += gather_futures[i]();
+        }
+      }
+      
+      /**************************************************************************/
+      /*                              apply phase                               */
+      /**************************************************************************/
+      vertexlocks[lvid].lock();
+      vprog.apply(context, vertex, gather_result.value);      
+      vertexlocks[lvid].unlock();
+
+
+      /**************************************************************************/
+      /*                            scatter phase                               */
+      /**************************************************************************/
+
+      // should I wait for the scatter? nah... but in case you want to
+      // the code is commented below
+      /*foreach(procid_t mirror, local_vertex.mirrors()) {
+        rmi.remote_call(mirror, 
+                        &powerlyra_async_engine::perform_scatter, 
+                        vid,
+                        vprog,
+                        local_vertex.data());
+      }*/
+
+      std::vector<request_future<void> > scatter_futures;
+      foreach(procid_t mirror, local_vertex.mirrors()) {
+        scatter_futures.push_back(
+            object_fiber_remote_request(rmi, 
+                                        mirror, 
+                                        &powerlyra_async_engine::perform_scatter, 
+                                        vid,
+                                        vprog,
+                                        local_vertex.data()));
+      }
+      perform_scatter_local(lvid, vprog);
+      for(size_t i = 0;i < scatter_futures.size(); ++i) 
+        scatter_futures[i]();
+
+      /************************************************************************/
+      /*                           Release Locks                              */
+      /************************************************************************/
+      // the scatter is used to release the chandy misra
+      // here I cleanup
+      if (!factorized_consistency) {
+        delete cm_handles[lvid];
+        cm_handles[lvid] = NULL;
+      }
+      release_exclusive_access_to_vertex(lvid);
+      if (track_task_time) {
+        total_completion_time[fiber_control::get_worker_id()] += 
+            task_time->current_time();
+        task_time->~timer();
+      }
+      programs_executed.inc(); 
+    }
+
+
+    /**
+     * \internal
+     * Per thread main loop
+     */
+    void thread_start(size_t threadid) {
+      bool has_sched_msg = false;
+      std::vector<std::vector<lvid_type> > internal_lvid;
+      lvid_type sched_lvid;
+      
+      message_type msg;
+      float last_aggregator_check = timer::approx_time_seconds();
+      timer ti; ti.start();
+      while(1) {
+        if (timer::approx_time_seconds() != last_aggregator_check && !endgame_mode) {
+          last_aggregator_check = timer::approx_time_seconds();
+          std::string key = aggregator.tick_asynchronous();
+          if (key != "") {
+            for (size_t i = 0;i < aggregation_lock.size(); ++i) {
+              aggregation_lock[i].lock();
+              aggregation_queue[i].push_back(key);
+              aggregation_lock[i].unlock();
+            }
+          }
+        }
+
+        // test the aggregator
+        while(!aggregation_queue[fiber_control::get_worker_id()].empty()) {
+          size_t wid = fiber_control::get_worker_id();
+          ASSERT_LT(wid, ncpus);
+          aggregation_lock[wid].lock();
+          std::string key = aggregation_queue[wid].front();
+          aggregation_queue[wid].pop_front();
+          aggregation_lock[wid].unlock();
+          aggregator.tick_asynchronous_compute(wid, key);
+        }
+
+        sched_status::status_enum stat = get_next_sched_task(threadid, sched_lvid, msg);
+
+
+        has_sched_msg = stat != sched_status::EMPTY;
+        if (stat != sched_status::EMPTY) {
+          eval_sched_task(sched_lvid, msg);
+          if (endgame_mode) rmi.dc().flush();
+        }
+        else if (!try_to_quit(threadid, has_sched_msg, sched_lvid, msg)) {
+          /*
+           * We failed to obtain a task, try to quit
+           */
+          if (has_sched_msg) {
+            eval_sched_task(sched_lvid, msg);
+          }
+        } else { 
+          break; 
+        }
+
+        if (fiber_control::worker_has_priority_fibers_on_queue()) {
+          fiber_control::yield();
+        }
+      }
+    } // end of thread start
+
+/**************************************************************************
+ *                         Main engine start()                            *
+ **************************************************************************/
+
+  public:
+
+
+    /**
+      * \brief Start the engine execution.
+      *
+      * This function starts the engine and does not
+      * return until the scheduler has no tasks remaining.
+      *
+      * \return the reason for termination
+      */
+    execution_status::status_enum start() {
+      bool old_fasttrack = rmi.dc().set_fast_track_requests(false);
+      logstream(LOG_INFO) << "Spawning " << nfibers << " threads" << std::endl;
+      ASSERT_TRUE(scheduler_ptr != NULL);
+      consensus->reset();
+
+      // now. It is of critical importance that we match the number of 
+      // actual workers
+     
+
+      // start the aggregator
+      aggregator.start(ncpus);
+      aggregator.aggregate_all_periodic();
+
+      started = true;
+
+      rmi.barrier();
+      size_t allocatedmem = memory_info::allocated_bytes();
+      rmi.all_reduce(allocatedmem);
+
+      engine_start_time = timer::approx_time_seconds();
+      force_stop = false;
+      endgame_mode = false;
+      programs_executed = 0;
+      nttqs = 0;
+      launch_timer.start();
+
+      termination_reason = execution_status::RUNNING;
+      if (rmi.procid() == 0) {
+        logstream(LOG_INFO) << "Total Allocated Bytes: " << allocatedmem << std::endl;
+      }
+      thrgroup.set_stacksize(stacksize);
+        
+      size_t effncpus = std::min(ncpus, fiber_control::get_instance().num_workers());
+      for (size_t i = 0; i < nfibers ; ++i) {
+        thrgroup.launch(boost::bind(&engine_type::thread_start, this, i), 
+                        i % effncpus);
+      }
+      thrgroup.join();
+      aggregator.stop();
+      // if termination reason was not changed, then it must be depletion
+      if (termination_reason == execution_status::RUNNING) {
+        termination_reason = execution_status::TASK_DEPLETION;
+      }
+
+      size_t ctasks = programs_executed.value;
+      rmi.all_reduce(ctasks);
+      programs_executed.value = ctasks;
+
+      logstream(LOG_INFO) << rmi.procid() << " #try_to_quit = " << nttqs
+                          << std::endl;
+
+      rmi.cout() << "Completed Tasks: " << programs_executed.value << std::endl;
+
+
+      size_t numjoins = messages.num_joins();
+      rmi.all_reduce(numjoins);
+      rmi.cout() << "Schedule Joins: " << numjoins << std::endl;
+
+      size_t numadds = messages.num_adds();
+      rmi.all_reduce(numadds);
+      rmi.cout() << "Schedule Adds: " << numadds << std::endl;
+
+      if (track_task_time) {
+        double total_task_time = 0;
+        for (size_t i = 0;i < total_completion_time.size(); ++i) {
+          total_task_time += total_completion_time[i];
+        }
+        rmi.all_reduce(total_task_time);
+        rmi.cerr() << "Average Task Completion Time = " 
+                   << total_task_time / programs_executed.value << std::endl;
+      }
+
+
+      ASSERT_TRUE(scheduler_ptr->empty());
+      started = false;
+
+      rmi.dc().set_fast_track_requests(old_fasttrack);
+      return termination_reason;
+    } // end of start
+
+
+  public:
+    aggregator_type* get_aggregator() { return &aggregator; }
+
+  }; // end of class
+} // namespace
+
+#include <graphlab/macros_undef.hpp>
+
+#endif // GRAPHLAB_DISTRIBUTED_ENGINE_HPP
+
diff --git a/src/graphlab/engine/powerlyra_sync_engine.hpp b/src/graphlab/engine/powerlyra_sync_engine.hpp
new file mode 100644
index 0000000000..c3ee00a472
--- /dev/null
+++ b/src/graphlab/engine/powerlyra_sync_engine.hpp
@@ -0,0 +1,2195 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2013.11  implement synchronous engine of powerlyra
+ *
+ */
+
+
+
+#ifndef GRAPHLAB_POWERLYRA_SYNC_ENGINE_HPP
+#define GRAPHLAB_POWERLYRA_SYNC_ENGINE_HPP
+
+#include <deque>
+#include <boost/bind.hpp>
+
+#include <graphlab/engine/iengine.hpp>
+
+#include <graphlab/vertex_program/ivertex_program.hpp>
+#include <graphlab/vertex_program/icontext.hpp>
+#include <graphlab/vertex_program/context.hpp>
+
+#include <graphlab/engine/execution_status.hpp>
+#include <graphlab/options/graphlab_options.hpp>
+
+
+
+
+#include <graphlab/parallel/pthread_tools.hpp>
+#include <graphlab/parallel/fiber_barrier.hpp>
+#include <graphlab/util/tracepoint.hpp>
+#include <graphlab/util/memory_info.hpp>
+#include <graphlab/util/triple.hpp>
+
+#include <graphlab/rpc/dc_dist_object.hpp>
+#include <graphlab/rpc/distributed_event_log.hpp>
+#include <graphlab/rpc/fiber_buffered_exchange.hpp>
+
+
+
+
+
+
+#include <graphlab/macros_def.hpp>
+
+#define TUNING
+namespace graphlab {
+
+
+  /**
+   * \ingroup engines
+   *
+   * \brief The synchronous engine executes all active vertex program
+   * synchronously in a sequence of super-step (iterations) in both the
+   * shared and distributed memory settings.
+   *
+   * \tparam VertexProgram The user defined vertex program which
+   * should implement the \ref graphlab::ivertex_program interface.
+   *
+   *
+   * ### Execution Semantics
+   *
+   * On start() the \ref graphlab::ivertex_program::init function is invoked
+   * on all vertex programs in parallel to initialize the vertex program,
+   * vertex data, and possibly signal vertices.
+   * The engine then proceeds to execute a sequence of
+   * super-steps (iterations) each of which is further decomposed into a
+   * sequence of minor-steps which are also executed synchronously:
+   * \li Receive all incoming messages (signals) by invoking the
+   * \ref graphlab::ivertex_program::init function on all
+   * vertex-programs that have incoming messages.  If a
+   * vertex-program does not have any incoming messages then it is
+   * not active during this super-step.
+   * \li Execute all gathers for active vertex programs by invoking
+   * the user defined \ref graphlab::ivertex_program::gather function
+   * on the edge direction returned by the
+   * \ref graphlab::ivertex_program::gather_edges function.  The gather
+   * functions can modify edge data but cannot modify the vertex
+   * program or vertex data and therefore can be executed on multiple
+   * edges in parallel.  The gather type is used to accumulate (sum)
+   * the result of the gather function calls.
+   * \li Execute all apply functions for active vertex-programs by
+   * invoking the user defined \ref graphlab::ivertex_program::apply
+   * function passing the sum of the gather functions.  If \ref
+   * graphlab::ivertex_program::gather_edges returns no edges then
+   * the default gather value is passed to apply.  The apply function
+   * can modify the vertex program and vertex data.
+   * \li Execute all scatters for active vertex programs by invoking
+   * the user defined \ref graphlab::ivertex_program::scatter function
+   * on the edge direction returned by the
+   * \ref graphlab::ivertex_program::scatter_edges function.  The scatter
+   * functions can modify edge data but cannot modify the vertex
+   * program or vertex data and therefore can be executed on multiple
+   * edges in parallel.
+   *
+   * ### Construction
+   *
+   * The synchronous engine is constructed by passing in a
+   * \ref graphlab::distributed_control object which manages coordination
+   * between engine threads and a \ref graphlab::distributed_graph object
+   * which is the graph on which the engine should be run.  The graph should
+   * already be populated and cannot change after the engine is constructed.
+   * In the distributed setting all program instances (running on each machine)
+   * should construct an instance of the engine at the same time.
+   *
+   * Computation is initiated by signaling vertices using either
+   * \ref graphlab::powerlyra_sync_engine::signal or
+   * \ref graphlab::powerlyra_sync_engine::signal_all.  In either case all
+   * machines should invoke signal or signal all at the same time.  Finally,
+   * computation is initiated by calling the
+   * \ref graphlab::powerlyra_sync_engine::start function.
+   *
+   * ### Example Usage
+   *
+   * The following is a simple example demonstrating how to use the engine:
+   * \code
+   * #include <graphlab.hpp>
+   *
+   * struct vertex_data {
+   *   // code
+   * };
+   * struct edge_data {
+   *   // code
+   * };
+   * typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type;
+   * typedef float gather_type;
+   * struct pagerank_vprog :
+   *   public graphlab::ivertex_program<graph_type, gather_type> {
+   *   // code
+   * };
+   *
+   * int main(int argc, char** argv) {
+   *   // Initialize control plain using mpi
+   *   graphlab::mpi_tools::init(argc, argv);
+   *   graphlab::distributed_control dc;
+   *   // Parse command line options
+   *   graphlab::command_line_options clopts("PageRank algorithm.");
+   *   std::string graph_dir;
+   *   clopts.attach_option("graph", &graph_dir, graph_dir,
+   *                        "The graph file.");
+   *   if(!clopts.parse(argc, argv)) {
+   *     std::cout << "Error in parsing arguments." << std::endl;
+   *     return EXIT_FAILURE;
+   *   }
+   *   graph_type graph(dc, clopts);
+   *   graph.load_structure(graph_dir, "tsv");
+   *   graph.finalize();
+   *   std::cout << "#vertices: " << graph.num_vertices()
+   *             << " #edges:" << graph.num_edges() << std::endl;
+   *   graphlab::powerlyra_sync_engine<pagerank_vprog> engine(dc, graph, clopts);
+   *   engine.signal_all();
+   *   engine.start();
+   *   std::cout << "Runtime: " << engine.elapsed_time();
+   *   graphlab::mpi_tools::finalize();
+   * }
+   * \endcode
+   *
+   *
+   *
+   * <a name=engineopts>Engine Options</a>
+   * =====================
+   * The synchronous engine supports several engine options which can
+   * be set as command line arguments using \c --engine_opts :
+   *
+   * \li <b>max_iterations</b>: (default: infinity) The maximum number
+   * of iterations (super-steps) to run.
+   *
+   * \li <b>timeout</b>: (default: infinity) The maximum time in
+   * seconds that the engine may run. When the time runs out the
+   * current iteration is completed and then the engine terminates.
+   *
+   * \li <b>use_cache</b>: (default: false) This is used to enable
+   * caching.  When caching is enabled the gather phase is skipped for
+   * vertices that already have a cached value.  To use caching the
+   * vertex program must either clear (\ref icontext::clear_gather_cache)
+   * or update (\ref icontext::post_delta) the cache values of
+   * neighboring vertices during the scatter phase.
+   *
+   * \li \b snapshot_interval If set to a positive value, a snapshot
+   * is taken every this number of iterations. If set to 0, a snapshot
+   * is taken before the first iteration. If set to a negative value,
+   * no snapshots are taken. Defaults to -1. A snapshot is a binary
+   * dump of the graph.
+   *
+   * \li \b snapshot_path If snapshot_interval is set to a value >=0,
+   * this option must be specified and should contain a target basename
+   * for the snapshot. The path including folder and file prefix in
+   * which the snapshots should be saved.
+   *
+   * \see graphlab::omni_engine
+   * \see graphlab::async_consistent_engine
+   * \see graphlab::semi_synchronous_engine
+   * \see graphlab::powerlyra_sync_engine
+   */
+  template<typename VertexProgram>
+  class powerlyra_sync_engine :
+    public iengine<VertexProgram> {
+
+  public:
+    /**
+     * \brief The user defined vertex program type. Equivalent to the
+     * VertexProgram template argument.
+     *
+     * The user defined vertex program type which should implement the
+     * \ref graphlab::ivertex_program interface.
+     */
+    typedef VertexProgram vertex_program_type;
+
+    /**
+     * \brief The user defined type returned by the gather function.
+     *
+     * The gather type is defined in the \ref graphlab::ivertex_program
+     * interface and is the value returned by the
+     * \ref graphlab::ivertex_program::gather function.  The
+     * gather type must have an <code>operator+=(const gather_type&
+     * other)</code> function and must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::gather_type gather_type;
+
+
+    /**
+     * \brief The user defined message type used to signal neighboring
+     * vertex programs.
+     *
+     * The message type is defined in the \ref graphlab::ivertex_program
+     * interface and used in the call to \ref graphlab::icontext::signal.
+     * The message type must have an
+     * <code>operator+=(const gather_type& other)</code> function and
+     * must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::message_type message_type;
+
+    /**
+     * \brief The type of data associated with each vertex in the graph
+     *
+     * The vertex data type must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::vertex_data_type vertex_data_type;
+
+    /**
+     * \brief The type of data associated with each edge in the graph
+     *
+     * The edge data type must be \ref sec_serializable.
+     */
+    typedef typename VertexProgram::edge_data_type edge_data_type;
+
+    /**
+     * \brief The type of graph supported by this vertex program
+     *
+     * See graphlab::distributed_graph
+     */
+    typedef typename VertexProgram::graph_type  graph_type;
+
+    /**
+     * \brief The type used to represent a vertex in the graph.
+     * See \ref graphlab::distributed_graph::vertex_type for details
+     *
+     * The vertex type contains the function
+     * \ref graphlab::distributed_graph::vertex_type::data which
+     * returns a reference to the vertex data as well as other functions
+     * like \ref graphlab::distributed_graph::vertex_type::num_in_edges
+     * which returns the number of in edges.
+     *
+     */
+    typedef typename graph_type::vertex_type          vertex_type;
+
+    /**
+     * \brief The type used to represent an edge in the graph.
+     * See \ref graphlab::distributed_graph::edge_type for details.
+     *
+     * The edge type contains the function
+     * \ref graphlab::distributed_graph::edge_type::data which returns a
+     * reference to the edge data.  In addition the edge type contains
+     * the function \ref graphlab::distributed_graph::edge_type::source and
+     * \ref graphlab::distributed_graph::edge_type::target.
+     *
+     */
+    typedef typename graph_type::edge_type            edge_type;
+
+    /**
+     * \brief The type of the callback interface passed by the engine to vertex
+     * programs.  See \ref graphlab::icontext for details.
+     *
+     * The context callback is passed to the vertex program functions and is
+     * used to signal other vertices, get the current iteration, and access
+     * information about the engine.
+     */
+    typedef icontext<graph_type, gather_type, message_type> icontext_type;
+
+  private:
+
+    /**
+     * \brief Local vertex type used by the engine for fast indexing
+     */
+    typedef typename graph_type::local_vertex_type    local_vertex_type;
+
+    /**
+     * \brief Local edge type used by the engine for fast indexing
+     */
+    typedef typename graph_type::local_edge_type      local_edge_type;
+
+    /**
+     * \brief Local vertex id type used by the engine for fast indexing
+     */
+    typedef typename graph_type::lvid_type            lvid_type;
+
+    std::vector<double> per_thread_compute_time;
+    /**
+     * \brief The actual instance of the context type used by this engine.
+     */
+    typedef context<powerlyra_sync_engine> context_type;
+    friend class context<powerlyra_sync_engine>;
+
+
+    /**
+     * \brief The type of the distributed aggregator inherited from iengine
+     */
+    typedef typename iengine<vertex_program_type>::aggregator_type aggregator_type;
+
+    /**
+     * \brief The object used to communicate with remote copies of the
+     * synchronous engine.
+     */
+    dc_dist_object< powerlyra_sync_engine<VertexProgram> > rmi;
+
+    /**
+     * \brief A reference to the distributed graph on which this
+     * synchronous engine is running.
+     */
+    graph_type& graph;
+
+    /**
+     * \brief The number of CPUs used.
+     */
+    size_t ncpus;
+
+    /**
+     * \brief The local worker threads used by this engine
+     */
+    fiber_group threads;
+
+    /**
+     * \brief A thread barrier that is used to control the threads in the
+     * thread pool.
+     */
+    fiber_barrier thread_barrier;
+
+    /**
+     * \brief The maximum number of super-steps (iterations) to run
+     * before terminating.  If the max iterations is reached the
+     * engine will terminate if their are no messages remaining.
+     */
+    size_t max_iterations;
+
+
+   /* 
+    * \brief When caching is enabled the gather phase is skipped for
+    * vertices that already have a cached value.  To use caching the
+    * vertex program must either clear (\ref icontext::clear_gather_cache)
+    * or update (\ref icontext::post_delta) the cache values of
+    * neighboring vertices during the scatter phase.
+    */
+    bool use_cache;
+
+    /**
+     * \brief A snapshot is taken every this number of iterations.
+     * If snapshot_interval == 0, a snapshot is only taken before the first
+     * iteration. If snapshot_interval < 0, no snapshots are taken.
+     */
+    int snapshot_interval;
+
+    /// \brief The target base name the snapshot is saved in.
+    std::string snapshot_path;
+
+    /**
+     * \brief A counter that tracks the current iteration number since
+     * start was last invoked.
+     */
+    size_t iteration_counter;
+
+    /**
+     * \brief The time in seconds at which the engine started.
+     */
+    float start_time;
+
+    /**
+     * \brief The total execution time.
+     */
+    double exec_time;
+
+    /**
+     * \brief The time spends on exch-msgs phase.
+     */
+    double exch_time;
+
+    /**
+     * \brief The time spends on recv-msgs phase.
+     */
+    double recv_time;
+
+    /**
+     * \brief The time spends on gather phase.
+     */
+    double gather_time;
+
+    /**
+     * \brief The time spends on apply phase.
+     */
+    double apply_time;
+
+    /**
+     * \brief The time spends on scatter phase.
+     */
+    double scatter_time;
+
+    /**
+     * \brief The interval time to print status.
+     */
+    float print_interval;
+
+    /**
+     * \brief The timeout time in seconds
+     */
+    float timeout;
+
+    /**
+     * \brief Schedules all vertices every iteration
+     */
+    bool sched_allv;
+
+    /**
+     * \brief Used to stop the engine prematurely
+     */
+    bool force_abort;
+
+    /**
+     * \brief The vertex locks protect access to vertex specific
+     * data-structures including
+     * \ref graphlab::powerlyra_sync_engine::gather_accum
+     * and \ref graphlab::powerlyra_sync_engine::messages.
+     */
+    std::vector<simple_spinlock> vlocks;
+
+    /**
+     * \brief The elocks protect individual edges during gather and
+     * scatter.  Technically there is a potential race since gather
+     * and scatter can modify edge values and can overlap.  The edge
+     * lock ensures that only one gather or scatter occurs on an edge
+     * at a time.
+     */
+    std::vector<simple_spinlock> elocks;
+
+    /**
+     * \brief The vertex programs associated with each vertex on this
+     * machine.
+     */
+    std::vector<vertex_program_type> vertex_programs;
+
+    /**
+     * \brief Vector of messages associated with each vertex.
+     */
+    std::vector<message_type> messages;
+
+    /**
+     * \brief Bit indicating whether a message is present for each vertex.
+     */
+    dense_bitset has_message;
+
+
+    /**
+     * \brief Gather accumulator used for each master vertex to merge
+     * the result of all the machine specific accumulators (or
+     * caches).
+     *
+     * The gather accumulator can be accessed by multiple threads at
+     * once and therefore must be guarded by a vertex locks in
+     * \ref graphlab::powerlyra_sync_engine::vlocks
+     */
+    std::vector<gather_type>  gather_accum;
+
+    /**
+     * \brief Bit indicating if the gather has accumulator contains any
+     * values.
+     *
+     * While dense bitsets are thread safe the value of this bit must
+     * change concurrently with the
+     * \ref graphlab::powerlyra_sync_engine::gather_accum and therefore is
+     * set while holding the lock in
+     * \ref graphlab::powerlyra_sync_engine::vlocks.
+     */
+    dense_bitset has_gather_accum;
+
+
+    /**
+     * \brief This optional vector contains caches of previous gather
+     * contributions for each machine.
+     *
+     * Caching is done locally and therefore a high-degree vertex may
+     * have multiple caches (one per machine).
+     */
+    std::vector<gather_type>  gather_cache;
+
+    /**
+     * \brief A bit indicating if the local gather for that vertex is
+     * available.
+     */
+    dense_bitset has_cache;
+
+    /**
+     * \brief A bit (for master vertices) indicating if that vertex is active
+     * (received a message on this iteration).
+     */
+    dense_bitset active_superstep;
+
+    /**
+     * \brief  The number of local vertices (masters) that are active on this
+     * iteration.
+     */
+    atomic<size_t> num_active_vertices;
+
+    /**
+     * \brief A bit indicating (for all vertices) whether to
+     * participate in the current minor-step (gather or scatter).
+     */
+    dense_bitset active_minorstep;
+
+    /**
+     * \brief A counter measuring the number of gathers that have been completed
+     */
+    atomic<size_t> completed_gathers;
+
+    /**
+     * \brief A counter measuring the number of applys that have been completed
+     */
+    atomic<size_t> completed_applys;
+
+    /**
+     * \brief A counter measuring the number of scatters that have been completed
+     */
+    atomic<size_t> completed_scatters;
+
+
+    /**
+     * \brief The shared counter used coordinate operations between
+     * threads.
+     */
+    atomic<size_t> shared_lvid_counter;
+    
+    /**
+     * \brief The engine type used to create express.
+     */
+    typedef powerlyra_sync_engine<VertexProgram> engine_type;
+
+    /**
+     * \brief The pair type used to synchronize vertex programs across machines.
+     */
+    typedef std::pair<vertex_id_type, vertex_program_type> vid_vprog_pair_type;
+
+    /**
+     * \brief The type of the express used to activate mirrors
+     */
+    typedef fiber_buffered_exchange<vid_vprog_pair_type> 
+      activ_exchange_type;
+
+    /**
+     * \brief The type of buffer used by the express to activate mirrors
+     */
+    typedef typename activ_exchange_type::buffer_type activ_buffer_type;
+
+    /**
+     * \brief The distributed express used to activate mirrors
+     * vertex programs.
+     */
+    activ_exchange_type activ_exchange;
+
+    /**
+     * \brief The triple type used to update vertex data and activate neighbors.
+     */
+    typedef triple<vertex_id_type, vertex_data_type, vertex_program_type> 
+      vid_vdata_vprog_triple_type;
+
+    /**
+     * \brief The type of the exchange used to update mirrors
+     */
+    typedef fiber_buffered_exchange<vid_vdata_vprog_triple_type> 
+      update_activ_exchange_type;
+
+    /**
+     * \brief The type of buffer used by the exchange to update mirrors
+     */
+    typedef typename update_activ_exchange_type::buffer_type 
+      update_activ_buffer_type;
+
+    /**
+     * \brief The distributed express used to update mirrors
+     * vertex programs.
+     */
+    update_activ_exchange_type update_activ_exchange;
+
+
+    /**
+     * \brief The triple type used to only update vertex data.
+     */
+    typedef std::pair<vertex_id_type, vertex_data_type> vid_vdata_pair_type;
+    
+    /**
+         * \brief The type of the express used to update mirrors
+         */
+    typedef fiber_buffered_exchange<vid_vdata_pair_type> update_exchange_type;
+
+    /**
+     * \brief The type of buffer used by the exchange to update mirrors
+     */
+    typedef typename update_exchange_type::buffer_type update_buffer_type;
+    
+    /**
+     * \brief The distributed express used to update mirrors
+     * vertex programs.
+     */
+    update_exchange_type update_exchange;
+
+
+    /**
+     * \brief The pair type used to synchronize the results of the gather phase
+     */
+    typedef std::pair<vertex_id_type, gather_type> vid_gather_pair_type;
+
+    /**
+     * \brief The type of the exchange used to synchronize accums
+     * accumulators
+     */
+    typedef fiber_buffered_exchange<vid_gather_pair_type> accum_exchange_type;
+
+    /**
+     * \brief The distributed exchange used to synchronize accums
+     * accumulators.
+     */
+    accum_exchange_type accum_exchange;
+
+    /**
+     * \brief The pair type used to synchronize messages
+     */
+    typedef std::pair<vertex_id_type, message_type> vid_message_pair_type;
+
+    /**
+     * \brief The type of the exchange used to synchronize messages
+     */
+    typedef fiber_buffered_exchange<vid_message_pair_type> message_exchange_type;
+
+    /**
+     * \brief The distributed exchange used to synchronize messages
+     */
+    message_exchange_type message_exchange;
+
+
+    /**
+     * \brief The distributed aggregator used to manage background
+     * aggregation.
+     */
+    aggregator_type aggregator;
+
+    DECLARE_EVENT(EVENT_APPLIES);
+    DECLARE_EVENT(EVENT_GATHERS);
+    DECLARE_EVENT(EVENT_SCATTERS);
+    DECLARE_EVENT(EVENT_ACTIVE_CPUS);
+  public:
+
+    /**
+     * \brief Construct a synchronous engine for a given graph and options.
+     *
+     * The synchronous engine should be constructed after the graph
+     * has been loaded (e.g., \ref graphlab::distributed_graph::load)
+     * and the graphlab options have been set
+     * (e.g., \ref graphlab::command_line_options).
+     *
+     * In the distributed engine the synchronous engine must be called
+     * on all machines at the same time (in the same order) passing
+     * the \ref graphlab::distributed_control object.  Upon
+     * construction the synchronous engine allocates several
+     * data-structures to store messages, gather accumulants, and
+     * vertex programs and therefore may require considerable memory.
+     *
+     * The number of threads to create are read from
+     * \ref graphlab_options::get_ncpus "opts.get_ncpus()".
+     *
+     * See the <a href="#engineopts">main class documentation</a>
+     * for details on the available options.
+     *
+     * @param [in] dc Distributed controller to associate with
+     * @param [in,out] graph A reference to the graph object that this
+     * engine will modify. The graph must be fully constructed and
+     * finalized.
+     * @param [in] opts A graphlab::graphlab_options object specifying engine
+     *                  parameters.  This is typically constructed using
+     *                  \ref graphlab::command_line_options.
+     */
+    powerlyra_sync_engine(distributed_control& dc, graph_type& graph,
+                       const graphlab_options& opts = graphlab_options());
+
+
+    /**
+     * \brief Start execution of the synchronous engine.
+     *
+     * The start function begins computation and does not return until
+     * there are no remaining messages or until max_iterations has
+     * been reached.
+     *
+     * The start() function modifies the data graph through the vertex
+     * programs and so upon return the data graph should contain the
+     * result of the computation.
+     *
+     * @return The reason for termination
+     */
+    execution_status::status_enum start();
+
+    // documentation inherited from iengine
+    size_t num_updates() const;
+
+    // documentation inherited from iengine
+    void signal(vertex_id_type vid,
+                const message_type& message = message_type());
+
+    // documentation inherited from iengine
+    void signal_all(const message_type& message = message_type(),
+                    const std::string& order = "shuffle");
+
+    void signal_vset(const vertex_set& vset,
+                    const message_type& message = message_type(),
+                    const std::string& order = "shuffle");
+
+
+    // documentation inherited from iengine
+    float elapsed_seconds() const;
+
+    /**
+     * \brief Get the current iteration number since start was last
+     * invoked.
+     *
+     *  \return the current iteration
+     */
+    int iteration() const;
+
+
+    /**
+     * \brief Compute the total memory used by the entire distributed
+     * system.
+     *
+     * @return The total memory used in bytes.
+     */
+    size_t total_memory_usage() const;
+
+    /**
+     * \brief Get a pointer to the distributed aggregator object.
+     *
+     * This is currently used by the \ref graphlab::iengine interface to
+     * implement the calls to aggregation.
+     *
+     * @return a pointer to the local aggregator.
+     */
+    aggregator_type* get_aggregator();
+
+    /**
+     * \brief Initialize the engine and allocate datastructures for vertex, and lock,
+     * clear all the messages.
+     */
+    void init();
+
+
+  private:
+
+
+    /**
+     * \brief Resize the datastructures to fit the graph size (in case of dynamic graph). Keep all the messages
+     * and caches.
+     */
+    void resize();
+
+    /**
+     * \brief This internal stop function is called by the \ref graphlab::context to
+     * terminate execution of the engine.
+     */
+    void internal_stop();
+
+    /**
+     * \brief This function is called remote by the rpc to force the
+     * engine to stop.
+     */
+    void rpc_stop();
+
+    /**
+     * \brief Signal a vertex.
+     *
+     * This function is called by the \ref graphlab::context.
+     *
+     * @param [in] vertex the vertex to signal
+     * @param [in] message the message to send to that vertex.
+     */
+    void internal_signal(const vertex_type& vertex,
+                         const message_type& message);
+
+    void internal_signal(const vertex_type& vertex);
+
+    /**
+     * \brief Called by the context to signal an arbitrary vertex.
+     * This must be done by finding the owner of that vertex.
+     *
+     * @param [in] gvid the global vertex id of the vertex to signal
+     * @param [in] message the message to send to that vertex.
+     */
+    void internal_signal_gvid(vertex_id_type gvid,
+                                   const message_type& message = message_type());
+
+    /**
+     * \brief This function tests if this machine is the master of
+     * gvid and signals if successful.
+     */
+    void internal_signal_rpc(vertex_id_type gvid,
+                              const message_type& message = message_type());
+
+
+    /**
+     * \brief Post a to a previous gather for a give vertex.
+     *
+     * This function is called by the \ref graphlab::context.
+     *
+     * @param [in] vertex The vertex to which to post a change in the sum
+     * @param [in] delta The change in that sum
+     */
+    void internal_post_delta(const vertex_type& vertex,
+                             const gather_type& delta);
+
+    /**
+     * \brief Clear the cached gather for a vertex if one is
+     * available.
+     *
+     * This function is called by the \ref graphlab::context.
+     *
+     * @param [in] vertex the vertex for which to clear the cache
+     */
+    void internal_clear_gather_cache(const vertex_type& vertex);
+
+
+    // Program Steps ==========================================================
+
+
+    void thread_launch_wrapped_event_counter(boost::function<void(void)> fn) {
+      INCREMENT_EVENT(EVENT_ACTIVE_CPUS, 1);
+      fn();
+      DECREMENT_EVENT(EVENT_ACTIVE_CPUS, 1);
+    }
+
+    /**
+     * \brief Executes ncpus copies of a member function each with a
+     * unique consecutive id (thread id).
+     *
+     * This function is used by the main loop to execute each of the
+     * stages in parallel.
+     *
+     * The member function must have the type:
+     *
+     * \code
+     * void powerlyra_sync_engine::member_fun(size_t threadid);
+     * \endcode
+     *
+     * This function runs an rmi barrier after termination
+     *
+     * @tparam the type of the member function.
+     * @param [in] member_fun the function to call.
+     */
+    template<typename MemberFunction>
+    void run_synchronous(MemberFunction member_fun) {
+      shared_lvid_counter = 0;
+      if (ncpus <= 1) {
+        INCREMENT_EVENT(EVENT_ACTIVE_CPUS, 1);
+      }
+      // launch the initialization threads
+      for(size_t i = 0; i < ncpus; ++i) {
+        fiber_control::affinity_type affinity;
+        affinity.clear(); affinity.set_bit(i);
+        boost::function<void(void)> invoke = boost::bind(member_fun, this, i);
+        threads.launch(boost::bind(
+              &powerlyra_sync_engine::thread_launch_wrapped_event_counter,
+              this,
+              invoke), affinity);
+      }
+      // Wait for all threads to finish
+      threads.join();
+      rmi.barrier();
+      if (ncpus <= 1) {
+        DECREMENT_EVENT(EVENT_ACTIVE_CPUS, 1);
+      }
+    } // end of run_synchronous
+
+    inline bool high_lvid(const lvid_type lvid);  
+    inline bool low_lvid(const lvid_type lvid);
+    
+    // /**
+    //  * \brief Initialize all vertex programs by invoking
+    //  * \ref graphlab::ivertex_program::init on all vertices.
+    //  *
+    //  * @param thread_id the thread to run this as which determines
+    //  * which vertices to process.
+    //  */
+    // void initialize_vertex_programs(size_t thread_id);
+
+    /**
+     * \brief Synchronize all message data.
+     *
+     * @param thread_id the thread to run this as which determines
+     * which vertices to process.
+     */
+    void exchange_messages(size_t thread_id);
+
+
+    /**
+     * \brief Invoke the \ref graphlab::ivertex_program::init function
+     * on all vertex programs that have inbound messages.
+     *
+     * @param thread_id the thread to run this as which determines
+     * which vertices to process.
+     */
+    void receive_messages(size_t thread_id);
+
+
+    /**
+     * \brief Execute the \ref graphlab::ivertex_program::gather function on all
+     * vertices that received messages for the edges specified by the
+     * \ref graphlab::ivertex_program::gather_edges.
+     *
+     * @param thread_id the thread to run this as which determines
+     * which vertices to process.
+     */
+    void execute_gathers(size_t thread_id);
+
+
+
+
+    /**
+     * \brief Execute the \ref graphlab::ivertex_program::apply function on all
+     * all vertices that received messages in this super-step (active).
+     *
+     * @param thread_id the thread to run this as which determines
+     * which vertices to process.
+     */
+    void execute_applys(size_t thread_id);
+
+    /**
+     * \brief Execute the \ref graphlab::ivertex_program::scatter function on all
+     * vertices that received messages for the edges specified by the
+     * \ref graphlab::ivertex_program::scatter_edges.
+     *
+     * @param thread_id the thread to run this as which determines
+     * which vertices to process.
+     */
+    void execute_scatters(size_t thread_id);
+
+    // Data Synchronization ===================================================
+    /**
+     * \brief Send the activation messages (vertex program and edge set) 
+     * for the local vertex id to all of its mirrors.
+     *
+     * @param [in] lvid the vertex to sync.  It must be the master of that vertex.
+     */
+    void send_activs(lvid_type lvid, size_t thread_id);
+
+    /**
+     * \brief do activation to local mirros.
+     *
+     * This function is a callback of express, and will be invoked when receives 
+     * activation message.
+     */
+    void recv_activs();
+
+    /**
+     * \brief Send the update messages (vertex data, program) 
+     * for the local vertex id to all of its mirrors.
+     *
+     * @param [in] lvid the vertex to sync.  It must be the master of that vertex.
+     */
+    void send_updates_activs(lvid_type lvid, size_t thread_id);
+
+    /**
+     * \brief do update and activation to local mirros.
+     *
+     * This function returns when there is nothing left in the
+     * buffered exchange and should be called after the buffered
+     * exchange has been flushed
+     */
+    void recv_updates_activs();
+    
+
+    /**
+     * \brief Send the update messages (vertex data, program and edge set) 
+     * for the local vertex id to all of its mirrors.
+     *
+     * @param [in] lvid the vertex to sync.  It must be the master of that vertex.
+     */
+    void send_updates(lvid_type lvid, size_t thread_id);
+
+    /**
+     * \brief do update to local mirros.
+     *
+     * This function returns when there is nothing left in the
+     * buffered exchange and should be called after the buffered
+     * exchange has been flushed
+     */
+    void recv_updates();
+
+    /**
+     * \brief Send the gather accum for the vertex id to its master.
+     *
+     * @param [in] lvid the vertex to send the gather value to
+     * @param [in] accum the locally computed gather value.
+     */
+    void send_accum(lvid_type lvid, const gather_type& accum,
+                        const size_t thread_id);
+
+
+    /**
+     * \brief Receive the gather accums from the buffered exchange.
+     *
+     * This function returns when there is nothing left in the
+     * buffered exchange and should be called after the buffered
+     * exchange has been flushed
+     */
+    void recv_accums();
+
+    /**
+     * \brief Send the scatter messages for the vertex id to its master.
+     *
+     * @param [in] lvid the vertex to send
+     */
+    void send_message(lvid_type lvid, const size_t thread_id);
+
+    /**
+     * \brief Receive the scatter messages from the buffered exchange.
+     *
+     * This function returns when there is nothing left in the
+     * buffered exchange and should be called after the buffered
+     * exchange has been flushed
+     */
+    void recv_messages();
+
+
+  }; // end of class powerlyra_sync_engine
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+  /**
+   * Constructs an synchronous distributed engine.
+   * The number of threads to create are read from
+   * opts::get_ncpus().
+   *
+   * Valid engine options (graphlab_options::get_engine_args()):
+   * \arg \c max_iterations Sets the maximum number of iterations the
+   * engine will run for.
+   * \arg \c use_cache If set to true, partial gathers are cached.
+   * See \ref gather_caching to understand the behavior of the
+   * gather caching model and how it may be used to accelerate program
+   * performance.
+   *
+   * \param dc Distributed controller to associate with
+   * \param graph The graph to schedule over. The graph must be fully
+   *              constructed and finalized.
+   * \param opts A graphlab_options object containing options and parameters
+   *             for the engine.
+   */
+  template<typename VertexProgram>
+  powerlyra_sync_engine<VertexProgram>::
+  powerlyra_sync_engine(distributed_control& dc,
+                     graph_type& graph,
+                     const graphlab_options& opts) :
+    rmi(dc, this), graph(graph),
+    ncpus(opts.get_ncpus()),
+    threads(2*1024*1024 /* 2MB stack per fiber*/),
+    thread_barrier(opts.get_ncpus()),
+    max_iterations(-1), snapshot_interval(-1), iteration_counter(0),
+    print_interval(5), timeout(0), sched_allv(false),
+    activ_exchange(dc),
+    update_activ_exchange(dc),
+    update_exchange(dc),
+    accum_exchange(dc),
+    message_exchange(dc),
+    aggregator(dc, graph, new context_type(*this, graph)) {
+    // Process any additional options
+    std::vector<std::string> keys = opts.get_engine_args().get_option_keys();
+    per_thread_compute_time.resize(opts.get_ncpus());
+    use_cache = false;
+    foreach(std::string opt, keys) {
+      if (opt == "max_iterations") {
+        opts.get_engine_args().get_option("max_iterations", max_iterations);
+        if (rmi.procid() == 0)
+          logstream(LOG_EMPH) << "Engine Option: max_iterations = "
+            << max_iterations << std::endl;
+      } else if (opt == "timeout") {
+        opts.get_engine_args().get_option("timeout", timeout);
+        if (rmi.procid() == 0)
+          logstream(LOG_EMPH) << "Engine Option: timeout = "
+            << timeout << std::endl;
+      } else if (opt == "use_cache") {
+        opts.get_engine_args().get_option("use_cache", use_cache);
+        if (rmi.procid() == 0)
+          logstream(LOG_EMPH) << "Engine Option: use_cache = "
+            << use_cache << std::endl;
+      } else if (opt == "snapshot_interval") {
+        opts.get_engine_args().get_option("snapshot_interval", snapshot_interval);
+        if (rmi.procid() == 0)
+          logstream(LOG_EMPH) << "Engine Option: snapshot_interval = "
+            << snapshot_interval << std::endl;
+      } else if (opt == "snapshot_path") {
+        opts.get_engine_args().get_option("snapshot_path", snapshot_path);
+        if (rmi.procid() == 0)
+          logstream(LOG_EMPH) << "Engine Option: snapshot_path = "
+            << snapshot_path << std::endl;
+      } else if (opt == "sched_allv") {
+        opts.get_engine_args().get_option("sched_allv", sched_allv);
+        if (rmi.procid() == 0)
+          logstream(LOG_EMPH) << "Engine Option: sched_allv = "
+            << sched_allv << std::endl;
+      } else {
+        logstream(LOG_FATAL) << "Unexpected Engine Option: " << opt << std::endl;
+      }
+    }
+
+    if (snapshot_interval >= 0 && snapshot_path.length() == 0) {
+      logstream(LOG_FATAL)
+        << "Snapshot interval specified, but no snapshot path" << std::endl;
+    }
+    INITIALIZE_EVENT_LOG(dc);
+    ADD_CUMULATIVE_EVENT(EVENT_APPLIES, "Applies", "Calls");
+    ADD_CUMULATIVE_EVENT(EVENT_GATHERS , "Gathers", "Calls");
+    ADD_CUMULATIVE_EVENT(EVENT_SCATTERS , "Scatters", "Calls");
+    ADD_INSTANTANEOUS_EVENT(EVENT_ACTIVE_CPUS, "Active Threads", "Threads");
+    graph.finalize();
+    init();
+  } // end of powerlyra_sync_engine
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>:: init() {
+    memory_info::log_usage("Before Engine Initialization");
+    
+    resize();
+    
+    // Clear up
+    force_abort = false;
+    iteration_counter = 0;
+    completed_gathers = 0;
+    completed_applys = 0;
+    completed_scatters = 0;
+    has_message.clear();
+    has_gather_accum.clear();
+    has_cache.clear();
+    active_superstep.clear();
+    active_minorstep.clear();
+
+    memory_info::log_usage("After Engine Initialization");
+  }
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>:: resize() {
+    size_t l_nverts = graph.num_local_vertices();
+
+    // Allocate vertex locks and vertex programs
+    vlocks.resize(l_nverts);
+    vertex_programs.resize(l_nverts);
+    
+    // Allocate messages and message bitset
+    messages.resize(l_nverts, message_type());
+    has_message.resize(l_nverts);
+    
+    // Allocate gather accumulators and accumulator bitset
+    gather_accum.resize(l_nverts, gather_type());
+    has_gather_accum.resize(l_nverts);
+
+    // If caching is used then allocate cache data-structures
+    if (use_cache) {
+      gather_cache.resize(l_nverts, gather_type());
+      has_cache.resize(l_nverts);
+    }
+    // Allocate bitset to track active vertices on each bitset.
+    active_superstep.resize(l_nverts);
+    active_minorstep.resize(l_nverts);
+  }
+
+
+  template<typename VertexProgram>
+  typename powerlyra_sync_engine<VertexProgram>::aggregator_type*
+  powerlyra_sync_engine<VertexProgram>::get_aggregator() {
+    return &aggregator;
+  } // end of get_aggregator
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::internal_stop() {
+    for (size_t i = 0; i < rmi.numprocs(); ++i)
+      rmi.remote_call(i, &engine_type::rpc_stop);
+  } // end of internal_stop
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::rpc_stop() {
+    force_abort = true;
+  } // end of rpc_stop
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  signal(vertex_id_type gvid, const message_type& message) {
+    if (vlocks.size() != graph.num_local_vertices())
+      resize();
+    rmi.barrier();
+    internal_signal_rpc(gvid, message);
+    rmi.barrier();
+  } // end of signal
+
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  signal_all(const message_type& message, const std::string& order) {
+    if (vlocks.size() != graph.num_local_vertices())
+      resize();
+    for(lvid_type lvid = 0; lvid < graph.num_local_vertices(); ++lvid) {
+      if(graph.l_is_master(lvid)) {
+        internal_signal(vertex_type(graph.l_vertex(lvid)), message);
+      }
+    }
+  } // end of signal all
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  signal_vset(const vertex_set& vset,
+             const message_type& message, const std::string& order) {
+    if (vlocks.size() != graph.num_local_vertices())
+      resize();
+    for(lvid_type lvid = 0; lvid < graph.num_local_vertices(); ++lvid) {
+      if(graph.l_is_master(lvid) && vset.l_contains(lvid)) {
+        internal_signal(vertex_type(graph.l_vertex(lvid)), message);
+      }
+    }
+  } // end of signal all
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  internal_signal(const vertex_type& vertex,
+                  const message_type& message) {
+    const lvid_type lvid = vertex.local_id();
+    vlocks[lvid].lock();
+    if( has_message.get(lvid) ) {
+      messages[lvid] += message;
+    } else {
+      messages[lvid] = message;
+      has_message.set_bit(lvid);
+    }
+    vlocks[lvid].unlock();
+  } // end of internal_signal
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  internal_signal(const vertex_type& vertex) {
+    const lvid_type lvid = vertex.local_id();
+    // set an empty message
+    messages[lvid] = message_type();
+    // atomic set is enough, without acquiring and releasing lock
+    has_message.set_bit(lvid);
+  } // end of internal_signal
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  internal_signal_gvid(vertex_id_type gvid, const message_type& message) {
+    procid_t proc = graph.master(gvid);
+    if(proc == rmi.procid()) internal_signal_rpc(gvid, message);
+    else rmi.remote_call(proc, 
+                         &powerlyra_sync_engine<VertexProgram>::internal_signal_rpc,
+                         gvid, message);
+  } // end of internal_signal_gvid
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  internal_signal_rpc(vertex_id_type gvid,
+                      const message_type& message) {
+    if (graph.is_master(gvid)) {
+      internal_signal(graph.vertex(gvid), message);
+    }
+  } // end of internal_signal_rpc
+
+
+
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  internal_post_delta(const vertex_type& vertex, const gather_type& delta) {
+    const bool caching_enabled = !gather_cache.empty();
+    if(caching_enabled) {
+      const lvid_type lvid = vertex.local_id();
+      vlocks[lvid].lock();
+      if( has_cache.get(lvid) ) {
+        gather_cache[lvid] += delta;
+      } else {
+        // You cannot add a delta to an empty cache.  A complete
+        // gather must have been run.
+        // gather_cache[lvid] = delta;
+        // has_cache.set_bit(lvid);
+      }
+      vlocks[lvid].unlock();
+    }
+  } // end of post_delta
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  internal_clear_gather_cache(const vertex_type& vertex) {
+    const bool caching_enabled = !gather_cache.empty();
+    const lvid_type lvid = vertex.local_id();
+    if(caching_enabled && has_cache.get(lvid)) {
+      vlocks[lvid].lock();
+      gather_cache[lvid] = gather_type();
+      has_cache.clear_bit(lvid);
+      vlocks[lvid].unlock();
+    }
+  } // end of clear_gather_cache
+
+
+
+
+  template<typename VertexProgram>
+  size_t powerlyra_sync_engine<VertexProgram>::
+  num_updates() const { return completed_applys.value; }
+
+  template<typename VertexProgram>
+  float powerlyra_sync_engine<VertexProgram>::
+  elapsed_seconds() const { return timer::approx_time_seconds() - start_time; }
+
+  template<typename VertexProgram>
+  int powerlyra_sync_engine<VertexProgram>::
+  iteration() const { return iteration_counter; }
+
+
+
+  template<typename VertexProgram>
+  size_t powerlyra_sync_engine<VertexProgram>::total_memory_usage() const {
+    size_t allocated_memory = memory_info::allocated_bytes();
+    rmi.all_reduce(allocated_memory);
+    return allocated_memory;
+  } // compute the total memory usage of the GraphLab system
+
+
+  template<typename VertexProgram> 
+  execution_status::status_enum powerlyra_sync_engine<VertexProgram>::
+  start() {
+    if (vlocks.size() != graph.num_local_vertices())
+      resize();
+    completed_gathers = 0;
+    completed_applys = 0;
+    completed_scatters = 0;
+    rmi.barrier();
+
+    // Initialization code ==================================================
+    // Reset event log counters?
+    // Start the timer
+    start_time = timer::approx_time_seconds();
+#ifdef TUNING
+    exec_time = exch_time = recv_time =
+      gather_time = apply_time = scatter_time = 0.0;
+    graphlab::timer ti, bk_ti;
+#endif
+    iteration_counter = 0;
+    force_abort = false;
+    execution_status::status_enum termination_reason = execution_status::UNSET;
+    aggregator.start();
+    rmi.barrier();
+
+    if (snapshot_interval == 0) {
+      graph.save_binary(snapshot_path);
+    }
+
+    float last_print = -print_interval; // print the first iteration
+    if (rmi.procid() == 0) {
+      logstream(LOG_EMPH) << "Iteration counter will only output every "
+                          << print_interval << " seconds."
+                          << std::endl;
+    }
+
+    // Program Main loop ====================================================
+#ifdef TUNING
+    ti.start();
+#endif
+    while(iteration_counter < max_iterations && !force_abort ) {
+      
+      // Check first to see if we are out of time
+      if(timeout != 0 && timeout < elapsed_seconds()) {
+        termination_reason = execution_status::TIMEOUT;
+        break;
+      }
+
+      bool print_this_round = (elapsed_seconds() - last_print) >= print_interval;
+      if(rmi.procid() == 0 && print_this_round) {
+        logstream(LOG_DEBUG)
+          << rmi.procid() << ": Starting iteration: " << iteration_counter
+          << std::endl;
+        last_print = elapsed_seconds();
+      }
+      // Reset Active vertices ----------------------------------------------
+      // Clear the active super-step and minor-step bits which will
+      // be set upon receiving messages
+      active_superstep.clear(); active_minorstep.clear();
+      has_gather_accum.clear();
+      num_active_vertices = 0;
+      rmi.barrier();
+
+      
+      // Exchange Messages --------------------------------------------------
+      // High: send messages from mirrors to master
+      // Low: none (if only IN_EDGES)
+      //
+      // if (rmi.procid() == 0) std::cout << "Exchange messages..." << std::endl;
+#ifdef TUNING
+      bk_ti.start();
+#endif
+      run_synchronous( &powerlyra_sync_engine::exchange_messages );
+#ifdef TUNING
+      exch_time += bk_ti.current_time();
+#endif
+      /**
+       * Post conditions:
+       *   1) master (high and low) vertices have messages
+       */
+
+      // Receive Messages ---------------------------------------------------
+      // 1. calculate the number of active vertices
+      // 2. call init and gather_edges
+      // 3. set active_superstep, active_minorstep and edge_dirs
+      // 4. clear has_message
+      //
+      // High: send vprog and edge_dirs from master to mirrors
+      // Low: none (if only IN_EDGES)
+      //
+      // if (rmi.procid() == 0) std::cout << "Receive messages..." << std::endl;
+#ifdef TUNING
+      bk_ti.start();
+#endif
+      run_synchronous( &powerlyra_sync_engine::receive_messages );
+      if (sched_allv) active_minorstep.fill();
+      has_message.clear();
+#ifdef TUNING
+      recv_time += bk_ti.current_time();
+#endif
+      /**
+       * Post conditions:
+       *   1) there are no messages remaining
+       *   2) All masters that received messages have their
+       *      active_superstep bit set
+       *   3) All masters and mirrors that are to participate in the
+       *      next gather phases have their active_minorstep bit
+       *      set.
+       *   4) num_active_vertices is the number of vertices that
+       *      received messages.
+       */
+
+      // Check termination condition  ---------------------------------------
+      size_t total_active_vertices = num_active_vertices;
+      rmi.all_reduce(total_active_vertices);
+      if (rmi.procid() == 0 && print_this_round)
+        logstream(LOG_EMPH)
+          << "\tActive vertices: " << total_active_vertices << std::endl;      
+      if(total_active_vertices == 0 ) {
+        termination_reason = execution_status::TASK_DEPLETION;
+        break;
+      }
+
+
+      // Execute gather operations-------------------------------------------
+      // 1. call pre_local_gather, gather and post_local_gather
+      // 2. (master) set gather_accum and has_gather_accum
+      // 3. clear active_minorstep
+      //
+      // High: send gather_accum from mirrors to master
+      // Low: none (if only IN_EDGES)
+      //
+      // if (rmi.procid() == 0) std::cout << "Gathering..." << std::endl;
+#ifdef TUNING
+      bk_ti.start();
+#endif
+      run_synchronous( &powerlyra_sync_engine::execute_gathers );
+      // Clear the minor step bit since only super-step vertices
+      // (only master vertices are required to participate in the
+      // apply step)
+      active_minorstep.clear();
+#ifdef TUNING
+      gather_time += bk_ti.current_time();
+#endif
+      /**
+       * Post conditions:
+       *   1) gather_accum for all master vertices contains the
+       *      result of all the gathers (even if they are drawn from
+       *      cache)
+       *   2) No minor-step bits are set
+       */
+
+      // Execute Apply Operations -------------------------------------------
+      // 1. call apply and scatter_edges
+      // 2. set edge_dirs and active_minorstep
+      // 3. send vdata, vprog and edge_dirs from master to replicas
+      //
+      // if (rmi.procid() == 0) std::cout << "Applying..." << std::endl;
+#ifdef TUNING
+      bk_ti.start();
+#endif
+      run_synchronous( &powerlyra_sync_engine::execute_applys );
+#ifdef TUNING
+      apply_time += bk_ti.current_time();
+#endif
+      /**
+       * Post conditions:
+       *   1) any changes to the vertex data have been synchronized
+       *      with all mirrors.
+       *   2) all gather accumulators have been cleared
+       *   3) If a vertex program is participating in the scatter
+       *      phase its minor-step bit has been set to active (both
+       *      masters and mirrors) and the vertex program has been
+       *      synchronized with the mirrors.
+       */
+
+
+      // Execute Scatter Operations -----------------------------------------
+      // 1. call scatter (signal: set messages and has_message)
+      //
+      // if (rmi.procid() == 0) std::cout << "Scattering..." << std::endl;
+#ifdef TUNING
+      bk_ti.start();
+#endif
+      run_synchronous( &powerlyra_sync_engine::execute_scatters );
+#ifdef TUNING
+      scatter_time += bk_ti.current_time();
+#endif
+      /**
+       * Post conditions:
+       *   1) NONE
+       */
+      if(rmi.procid() == 0 && print_this_round)
+        logstream(LOG_EMPH) << "\t Running Aggregators" << std::endl;
+      // probe the aggregator
+      aggregator.tick_synchronous();
+
+      ++iteration_counter;
+
+      if (snapshot_interval > 0 && iteration_counter % snapshot_interval == 0) {
+        graph.save_binary(snapshot_path);
+      }
+    }
+#ifdef TUNING
+    exec_time = ti.current_time();
+#endif
+
+    if (rmi.procid() == 0) {
+      logstream(LOG_EMPH) << iteration_counter
+                          << " iterations completed." << std::endl;
+    }
+    // Final barrier to ensure that all engines terminate at the same time
+    double total_compute_time = 0;
+    for (size_t i = 0;i < per_thread_compute_time.size(); ++i) {
+      total_compute_time += per_thread_compute_time[i];
+    }
+    std::vector<double> all_compute_time_vec(rmi.numprocs());
+    all_compute_time_vec[rmi.procid()] = total_compute_time;
+    rmi.all_gather(all_compute_time_vec);
+
+    /*logstream(LOG_INFO) << "Local Calls(G|A|S): "
+                        << completed_gathers.value << "|" 
+                        << completed_applys.value << "|"
+                        << completed_scatters.value 
+                        << std::endl;*/
+    
+    size_t global_completed = completed_applys;
+    rmi.all_reduce(global_completed);
+    completed_applys = global_completed;
+    rmi.cout() << "Updates: " << completed_applys.value << "\n";
+
+#ifdef TUNING
+    global_completed = completed_gathers;
+    rmi.all_reduce(global_completed);
+    completed_gathers = global_completed;
+
+    global_completed = completed_scatters;
+    rmi.all_reduce(global_completed);
+    completed_scatters = global_completed;
+#endif
+
+    if (rmi.procid() == 0) {
+      logstream(LOG_INFO) << "Compute Balance: ";
+      for (size_t i = 0;i < all_compute_time_vec.size(); ++i) {
+        logstream(LOG_INFO) << all_compute_time_vec[i] << " ";
+      }
+      logstream(LOG_INFO) << std::endl;
+#ifdef TUNING
+      logstream(LOG_INFO) << "Total Calls(G|A|S): " 
+                          << completed_gathers.value << "|" 
+                          << completed_applys.value << "|"
+                          << completed_scatters.value 
+                          << std::endl;
+      logstream(LOG_INFO) << std::endl;      
+      logstream(LOG_EMPH) << "      Execution Time: " << exec_time << std::endl;
+      logstream(LOG_EMPH) << "Breakdown(X|R|G|A|S): " 
+                          << exch_time << "|"
+                          << recv_time << "|"
+                          << gather_time << "|"
+                          << apply_time << "|"
+                          << scatter_time
+                          << std::endl;
+#endif
+    }
+
+    rmi.full_barrier();
+    // Stop the aggregator
+    aggregator.stop();
+    // return the final reason for termination
+    return termination_reason;
+  } // end of start
+
+  template<typename VertexProgram>
+  inline bool powerlyra_sync_engine<VertexProgram>::
+  high_lvid(const lvid_type lvid) {
+    return graph.l_degree_type(lvid) == graph_type::HIGH;
+  }
+
+  template<typename VertexProgram>
+  inline bool powerlyra_sync_engine<VertexProgram>::
+  low_lvid(const lvid_type lvid) {
+    return graph.l_degree_type(lvid) == graph_type::LOW;
+  }
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  exchange_messages(const size_t thread_id) {
+    context_type context(*this, graph);
+    fixed_dense_bitset<8 * sizeof(size_t)> local_bitset; // a word-size = 64 bit
+    const size_t TRY_RECV_MOD = 100;
+    size_t vcount = 1; // avoid unnecessarily call recv_messages() 
+    
+    while (1) {
+      // increment by a word at a time
+      lvid_type lvid_block_start =
+                  shared_lvid_counter.inc_ret_last(8 * sizeof(size_t));
+      if (lvid_block_start >= graph.num_local_vertices()) break;
+      // get the bit field from has_message
+      size_t lvid_bit_block = has_message.containing_word(lvid_block_start);
+      if (lvid_bit_block == 0) continue;
+      // initialize a word sized bitfield
+      local_bitset.clear();
+      local_bitset.initialize_from_mem(&lvid_bit_block, sizeof(size_t));
+      foreach(size_t lvid_block_offset, local_bitset) {
+        lvid_type lvid = lvid_block_start + lvid_block_offset;
+        if (lvid >= graph.num_local_vertices()) break;
+
+        // [TARGET]: High/Low-degree Mirrors
+        if(!graph.l_is_master(lvid)) {        
+          send_message(lvid, thread_id);
+          has_message.clear_bit(lvid);
+          // clear the message to save memory
+          messages[lvid] = message_type();
+          ++vcount;
+        }
+        if(vcount % TRY_RECV_MOD == 0) recv_messages();
+      }
+    } // end of loop over vertices to send messages
+    message_exchange.partial_flush();
+    // Finish sending and receiving all messages
+    thread_barrier.wait();
+    if(thread_id == 0) message_exchange.flush();
+    thread_barrier.wait();
+    recv_messages();    
+  } // end of exchange_messages
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  receive_messages(const size_t thread_id) {
+    context_type context(*this, graph);
+    fixed_dense_bitset<8 * sizeof(size_t)> local_bitset; // a word-size = 64 bit
+    const size_t TRY_RECV_MOD = 100;
+    size_t vcount = 0;
+    size_t nactive_inc = 0;
+    
+    while (1) {
+      // increment by a word at a time
+      lvid_type lvid_block_start =
+                  shared_lvid_counter.inc_ret_last(8 * sizeof(size_t));
+      if (lvid_block_start >= graph.num_local_vertices()) break;
+      // get the bit field from has_message
+      size_t lvid_bit_block = has_message.containing_word(lvid_block_start);
+      if (lvid_bit_block == 0) continue;
+      // initialize a word sized bitfield
+      local_bitset.clear();
+      local_bitset.initialize_from_mem(&lvid_bit_block, sizeof(size_t));
+      foreach(size_t lvid_block_offset, local_bitset) {
+        lvid_type lvid = lvid_block_start + lvid_block_offset;
+        if (lvid >= graph.num_local_vertices()) break;
+
+        ASSERT_TRUE(graph.l_is_master(lvid));
+        // The vertex becomes active for this superstep
+        active_superstep.set_bit(lvid);
+        ++nactive_inc;
+        // Pass the message to the vertex program
+        const vertex_type vertex(graph.l_vertex(lvid));
+        vertex_programs[lvid].init(context, vertex, messages[lvid]);
+        // clear the message to save memory
+        messages[lvid] = message_type();
+        if (sched_allv) continue;
+        // Determine if the gather should be run
+        const vertex_program_type& const_vprog = vertex_programs[lvid];
+        edge_dir_type gather_dir = const_vprog.gather_edges(context, vertex);
+        if(gather_dir != graphlab::NO_EDGES) {
+          active_minorstep.set_bit(lvid);
+          // send Gx1 msgs
+          if (high_lvid(lvid)
+              || (low_lvid(lvid) // only if gather via out-edge
+                && ((gather_dir == graphlab::ALL_EDGES) 
+                    || (gather_dir == graphlab::OUT_EDGES)))) {
+            send_activs(lvid, thread_id);
+          }
+        }
+        if(++vcount % TRY_RECV_MOD == 0) recv_activs();
+      }
+    }
+    num_active_vertices += nactive_inc;
+    activ_exchange.partial_flush();
+    // Flush the buffer and finish receiving any remaining vertex
+    // programs.
+    thread_barrier.wait();
+    // Flush the buffer and finish receiving any remaining activations.
+    if(thread_id == 0) activ_exchange.flush();
+    thread_barrier.wait();
+    recv_activs();
+  } // end of receive_messages
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  execute_gathers(const size_t thread_id) {
+    context_type context(*this, graph);
+    const bool caching_enabled = !gather_cache.empty();
+    fixed_dense_bitset<8 * sizeof(size_t)> local_bitset; // a word-size = 64 bit    
+    const size_t TRY_RECV_MOD = 1000;
+    size_t vcount = 0;
+    size_t ngather_inc = 0;
+    timer ti;
+    
+    while (1) {
+      // increment by a word at a time
+      lvid_type lvid_block_start =
+                  shared_lvid_counter.inc_ret_last(8 * sizeof(size_t));
+      if (lvid_block_start >= graph.num_local_vertices()) break;
+      // get the bit field from has_message
+      size_t lvid_bit_block = active_minorstep.containing_word(lvid_block_start);
+      if (lvid_bit_block == 0) continue;
+      // initialize a word sized bitfield
+      local_bitset.clear();
+      local_bitset.initialize_from_mem(&lvid_bit_block, sizeof(size_t));
+      foreach(size_t lvid_block_offset, local_bitset) {
+        lvid_type lvid = lvid_block_start + lvid_block_offset;
+        if (lvid >= graph.num_local_vertices()) break;
+
+        // [TARGET]: High/Low-degree Masters, and High/Low-degree Mirrors
+        bool accum_is_set = false;
+        gather_type accum = gather_type();
+        // if caching is enabled and we have a cache entry then use
+        // that as the accum
+        if (caching_enabled && has_cache.get(lvid)) {
+          accum = gather_cache[lvid];
+          accum_is_set = true;
+        } else {
+          // recompute the local contribution to the gather
+          const vertex_program_type& vprog = vertex_programs[lvid];
+          local_vertex_type local_vertex = graph.l_vertex(lvid);
+          const vertex_type vertex(local_vertex);
+          const edge_dir_type gather_dir = vprog.gather_edges(context, vertex);
+          
+          size_t edges_touched = 0;
+          vprog.pre_local_gather(accum);
+          // Loop over in edges
+          if (gather_dir == IN_EDGES || gather_dir == ALL_EDGES) {
+            foreach(local_edge_type local_edge, local_vertex.in_edges()) {
+              edge_type edge(local_edge);
+              // elocks[local_edge.id()].lock();
+              if(accum_is_set) { // \todo hint likely
+                accum += vprog.gather(context, vertex, edge);
+              } else {
+                accum = vprog.gather(context, vertex, edge);
+                accum_is_set = true;
+              }
+              // elocks[local_edge.id()].unlock();
+              ++edges_touched;
+            }
+          } // end of if in_edges/all_edges
+          // Loop over out edges
+          if(gather_dir == OUT_EDGES || gather_dir == ALL_EDGES) {
+            foreach(local_edge_type local_edge, local_vertex.out_edges()) {
+              edge_type edge(local_edge);
+              // elocks[local_edge.id()].lock();
+              if(accum_is_set) { // \todo hint likely
+                accum += vprog.gather(context, vertex, edge);
+              } else {
+                accum = vprog.gather(context, vertex, edge);
+                accum_is_set = true;
+              }
+              // elocks[local_edge.id()].unlock();
+              ++edges_touched;
+            }
+          } // end of if out_edges/all_edges
+          INCREMENT_EVENT(EVENT_GATHERS, edges_touched);
+          ++ngather_inc;
+          vprog.post_local_gather(accum);
+          
+          // If caching is enabled then save the accumulator to the
+          // cache for future iterations.  Note that it is possible
+          // that the accumulator was never set in which case we are
+          // effectively "zeroing out" the cache.
+          if(caching_enabled && accum_is_set) {
+            gather_cache[lvid] = accum; has_cache.set_bit(lvid);
+          } // end of if caching enabled
+        }
+
+        // If the accum contains a value for the gather
+        if(accum_is_set) send_accum(lvid, accum, thread_id);
+        if(!graph.l_is_master(lvid)) {
+          // if this is not the master clear the vertex program
+          vertex_programs[lvid] = vertex_program_type();
+        }
+
+        // try to recv gathers if there are any in the buffer
+        if(++vcount % TRY_RECV_MOD == 0) recv_accums();
+      }
+    } // end of loop over vertices to compute gather accumulators
+    completed_gathers += ngather_inc;
+    per_thread_compute_time[thread_id] += ti.current_time();
+    accum_exchange.partial_flush();
+    // Finish sending and receiving all gather operations
+    thread_barrier.wait();
+    if(thread_id == 0) accum_exchange.flush();
+    thread_barrier.wait();
+    recv_accums();
+  } // end of execute_gathers
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  execute_applys(const size_t thread_id) {
+    context_type context(*this, graph);
+    fixed_dense_bitset<8 * sizeof(size_t)> local_bitset;  // allocate a word size = 64bits
+    const size_t TRY_RECV_MOD = 1000;
+    size_t vcount = 0;
+    size_t napply_inc = 0;
+    timer ti;
+    
+    while (1) {
+      // increment by a word at a time
+      lvid_type lvid_block_start =
+                  shared_lvid_counter.inc_ret_last(8 * sizeof(size_t));
+      if (lvid_block_start >= graph.num_local_vertices()) break;
+      // get the bit field from has_message
+      size_t lvid_bit_block = active_superstep.containing_word(lvid_block_start);
+      if (lvid_bit_block == 0) continue;
+      // initialize a word sized bitfield
+      local_bitset.clear();
+      local_bitset.initialize_from_mem(&lvid_bit_block, sizeof(size_t));
+      foreach(size_t lvid_block_offset, local_bitset) {
+        lvid_type lvid = lvid_block_start + lvid_block_offset;
+        if (lvid >= graph.num_local_vertices()) break;
+
+        // [TARGET]: High/Low-degree Masters
+        // Only master vertices can be active in a super-step
+        ASSERT_TRUE(graph.l_is_master(lvid));
+        vertex_type vertex(graph.l_vertex(lvid));
+        // Get the local accumulator.  Note that it is possible that
+        // the gather_accum was not set during the gather.
+        const gather_type& accum = gather_accum[lvid];
+        INCREMENT_EVENT(EVENT_APPLIES, 1);
+        vertex_programs[lvid].apply(context, vertex, accum);
+        // record an apply as a completed task
+        ++napply_inc;
+        // clear the accumulator to save some memory
+        gather_accum[lvid] = gather_type();
+        // determine if a scatter operation is needed
+        const vertex_program_type& const_vprog = vertex_programs[lvid];
+        const vertex_type const_vertex = vertex;
+        
+        if (const_vprog.scatter_edges(context, const_vertex) 
+            != graphlab::NO_EDGES) {
+          // send Ax1 and Sx1
+          send_updates_activs(lvid, thread_id);
+          active_minorstep.set_bit(lvid);
+        } else {
+          // send Ax1
+          send_updates(lvid, thread_id);
+          vertex_programs[lvid] = vertex_program_type();
+        }
+
+        if(++vcount % TRY_RECV_MOD == 0) {
+          recv_updates_activs(); recv_updates();
+        }
+      }
+    } // end of loop over vertices to run apply
+    completed_applys += napply_inc;
+    per_thread_compute_time[thread_id] += ti.current_time();
+    update_activ_exchange.partial_flush(); update_exchange.partial_flush();
+    thread_barrier.wait();
+    // Flush the buffer and finish receiving any remaining updates.
+    if(thread_id == 0) {
+      update_activ_exchange.flush(); update_exchange.flush();
+    }
+    thread_barrier.wait();
+    recv_updates_activs(); recv_updates();
+    
+  } // end of execute_applys
+
+
+  template<typename VertexProgram>
+  void powerlyra_sync_engine<VertexProgram>::
+  execute_scatters(const size_t thread_id) {
+    context_type context(*this, graph);
+    fixed_dense_bitset<8 * sizeof(size_t)> local_bitset; // allocate a word size = 64 bits
+    size_t nscatter_inc = 0;
+    timer ti;
+    
+    while (1) {
+      // increment by a word at a time
+      lvid_type lvid_block_start =
+                  shared_lvid_counter.inc_ret_last(8 * sizeof(size_t));
+      if (lvid_block_start >= graph.num_local_vertices()) break;
+      // get the bit field from has_message
+      size_t lvid_bit_block = active_minorstep.containing_word(lvid_block_start);
+      if (lvid_bit_block == 0) continue;
+      // initialize a word sized bitfield
+      local_bitset.clear();
+      local_bitset.initialize_from_mem(&lvid_bit_block, sizeof(size_t));
+      foreach(size_t lvid_block_offset, local_bitset) {
+        lvid_type lvid = lvid_block_start + lvid_block_offset;
+        if (lvid >= graph.num_local_vertices()) break;
+
+        // [TARGET]: High/Low-degree Masters, and High/Low-degree Mirrors
+        const vertex_program_type& vprog = vertex_programs[lvid];
+        local_vertex_type local_vertex = graph.l_vertex(lvid);
+        const vertex_type vertex(local_vertex);
+        const edge_dir_type scatter_dir = vprog.scatter_edges(context, vertex);
+
+        size_t edges_touched = 0;
+        // Loop over in edges
+        if(scatter_dir == IN_EDGES || scatter_dir == ALL_EDGES) {
+          foreach(local_edge_type local_edge, local_vertex.in_edges()) {
+            edge_type edge(local_edge);
+            // elocks[local_edge.id()].lock();
+            vprog.scatter(context, vertex, edge);
+            // elocks[local_edge.id()].unlock();
+            ++edges_touched;
+          }
+        } // end of if in_edges/all_edges
+        // Loop over out edges
+        if(scatter_dir == OUT_EDGES || scatter_dir == ALL_EDGES) {
+          foreach(local_edge_type local_edge, local_vertex.out_edges()) {
+            edge_type edge(local_edge);
+            // elocks[local_edge.id()].lock();
+            vprog.scatter(context, vertex, edge);
+            // elocks[local_edge.id()].unlock();
+            ++edges_touched;
+          }
+        } // end of if out_edges/all_edges
+        INCREMENT_EVENT(EVENT_SCATTERS, edges_touched);
+        // Clear the vertex program
+        vertex_programs[lvid] = vertex_program_type();
+        ++nscatter_inc;
+      } // end of if active on this minor step
+    } // end of loop over vertices to complete scatter operation
+    completed_scatters += nscatter_inc;
+    per_thread_compute_time[thread_id] += ti.current_time();
+  } // end of execute_scatters
+
+
+
+  // Data Synchronization ===================================================
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  send_activs(lvid_type lvid, const size_t thread_id) {
+    ASSERT_TRUE(graph.l_is_master(lvid));
+    const vertex_id_type vid = graph.global_vid(lvid);
+    local_vertex_type vertex = graph.l_vertex(lvid);
+    foreach(const procid_t& mirror, vertex.mirrors()) {
+      activ_exchange.send(mirror,
+                          std::make_pair(vid, vertex_programs[lvid]));
+    }
+  } // end of send_activ
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  recv_activs() {
+    typename activ_exchange_type::recv_buffer_type recv_buffer;
+    while(activ_exchange.recv(recv_buffer)) {
+      for (size_t i = 0;i < recv_buffer.size(); ++i) {
+        typename activ_exchange_type::buffer_type& buffer = recv_buffer[i].buffer;
+        foreach(const vid_vprog_pair_type& pair, buffer) {
+          const lvid_type lvid = graph.local_vid(pair.first);
+          ASSERT_FALSE(graph.l_is_master(lvid));
+          vertex_programs[lvid] = pair.second;
+          active_minorstep.set_bit(lvid);
+        }
+      }
+    }
+  } // end of recv activs programs
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  send_updates_activs(lvid_type lvid, const size_t thread_id) {
+    ASSERT_TRUE(graph.l_is_master(lvid));
+    const vertex_id_type vid = graph.global_vid(lvid);
+    local_vertex_type vertex = graph.l_vertex(lvid);
+    foreach(const procid_t& mirror, vertex.mirrors()) {
+      update_activ_exchange.send(mirror, 
+                           make_triple(vid, 
+                                       vertex.data(), 
+                                       vertex_programs[lvid]));
+    }
+  } // end of send_update
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  recv_updates_activs() {
+    typename update_activ_exchange_type::recv_buffer_type recv_buffer;
+    while(update_activ_exchange.recv(recv_buffer)) {
+      for (size_t i = 0;i < recv_buffer.size(); ++i) {
+        update_activ_buffer_type& buffer = recv_buffer[i].buffer;
+        foreach(const vid_vdata_vprog_triple_type& t, buffer) {
+          const lvid_type lvid = graph.local_vid(t.first);
+          ASSERT_FALSE(graph.l_is_master(lvid));
+          graph.l_vertex(lvid).data() = t.second;
+          vertex_programs[lvid] = t.third;
+          active_minorstep.set_bit(lvid);
+        }
+      }
+    }
+  } // end of recv_updates
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  send_updates(lvid_type lvid, const size_t thread_id) {
+    ASSERT_TRUE(graph.l_is_master(lvid));
+    const vertex_id_type vid = graph.global_vid(lvid);
+    local_vertex_type vertex = graph.l_vertex(lvid);
+    foreach(const procid_t& mirror, vertex.mirrors()) {
+      update_exchange.send(mirror, std::make_pair(vid, vertex.data()));
+    }
+  } // end of send_update
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  recv_updates() {
+    typename update_exchange_type::recv_buffer_type recv_buffer;
+    while(update_exchange.recv(recv_buffer)) {
+      for (size_t i = 0;i < recv_buffer.size(); ++i) {
+        update_buffer_type& buffer = recv_buffer[i].buffer;
+        foreach(const vid_vdata_pair_type& pair, buffer) {
+          const lvid_type lvid = graph.local_vid(pair.first);
+          ASSERT_FALSE(graph.l_is_master(lvid));
+          graph.l_vertex(lvid).data() = pair.second;
+        }
+      }
+    }
+  } // end of recv_updates
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  send_accum(lvid_type lvid, const gather_type& accum, const size_t thread_id) {
+    if(graph.l_is_master(lvid)) {
+      vlocks[lvid].lock();
+      if(has_gather_accum.get(lvid)) {
+        gather_accum[lvid] += accum;
+      } else {
+        gather_accum[lvid] = accum;
+        has_gather_accum.set_bit(lvid);
+      }
+      vlocks[lvid].unlock();
+    } else {
+      const procid_t master = graph.l_master(lvid);
+      const vertex_id_type vid = graph.global_vid(lvid);
+      accum_exchange.send(master, std::make_pair(vid, accum));
+    }
+  } // end of send_accum
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  recv_accums() {
+    typename accum_exchange_type::recv_buffer_type recv_buffer;
+    while(accum_exchange.recv(recv_buffer)) {
+      for (size_t i = 0; i < recv_buffer.size(); ++i) {
+        typename accum_exchange_type::buffer_type& buffer = recv_buffer[i].buffer;
+        foreach(const vid_gather_pair_type& pair, buffer) {
+          const lvid_type lvid = graph.local_vid(pair.first);
+          const gather_type& acc = pair.second;
+          ASSERT_TRUE(graph.l_is_master(lvid));
+          vlocks[lvid].lock();
+          if(has_gather_accum.get(lvid)) {
+            gather_accum[lvid] += acc;
+          } else {
+            gather_accum[lvid] = acc;
+            has_gather_accum.set_bit(lvid);
+          }
+          vlocks[lvid].unlock();
+        }
+      }
+    }
+  } // end of recv_accums
+
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  send_message(lvid_type lvid, const size_t thread_id) {
+    ASSERT_FALSE(graph.l_is_master(lvid));
+    const procid_t master = graph.l_master(lvid);
+    const vertex_id_type vid = graph.global_vid(lvid);
+    message_exchange.send(master, std::make_pair(vid, messages[lvid]));
+  } // end of send_message
+
+  template<typename VertexProgram>
+  inline void powerlyra_sync_engine<VertexProgram>::
+  recv_messages() {
+    typename message_exchange_type::recv_buffer_type recv_buffer;
+    while(message_exchange.recv(recv_buffer)) {
+      for (size_t i = 0;i < recv_buffer.size(); ++i) {
+        typename message_exchange_type::buffer_type& buffer = recv_buffer[i].buffer;
+        foreach(const vid_message_pair_type& pair, buffer) {
+          const lvid_type lvid = graph.local_vid(pair.first);
+          const message_type& msg = pair.second;
+          ASSERT_TRUE(graph.l_is_master(lvid));
+          vlocks[lvid].lock();
+          if(has_message.get(lvid)) {
+            messages[lvid] += msg;
+          } else {
+            messages[lvid] = msg;
+            has_message.set_bit(lvid);
+          }
+          vlocks[lvid].unlock();
+        }
+      }
+    }
+  } // end of recv_messages
+
+}; // namespace
+
+
+#include <graphlab/macros_undef.hpp>
+
+#endif
+
diff --git a/src/graphlab/engine/synchronous_engine.hpp b/src/graphlab/engine/synchronous_engine.hpp
index d5dc1e560c..260a27d803 100644
--- a/src/graphlab/engine/synchronous_engine.hpp
+++ b/src/graphlab/engine/synchronous_engine.hpp
@@ -56,6 +56,7 @@
 
 #include <graphlab/macros_def.hpp>
 
+#define TUNING
 namespace graphlab {
 
 
@@ -394,6 +395,36 @@ namespace graphlab {
      */
     float start_time;
 
+    /**
+     * \brief The total execution time.
+     */
+    double exec_time;
+
+    /**
+     * \brief The time spends on exch-msgs phase.
+     */
+    double exch_time;
+
+    /**
+     * \brief The time spends on recv-msgs phase.
+     */
+    double recv_time;
+
+    /**
+     * \brief The time spends on gather phase.
+     */
+    double gather_time;
+
+    /**
+     * \brief The time spends on apply phase.
+     */
+    double apply_time;
+
+    /**
+     * \brief The time spends on scatter phase.
+     */
+    double scatter_time;
+
     /**
      * \brief The timeout time in seconds
      */
@@ -503,11 +534,21 @@ namespace graphlab {
      */
     dense_bitset active_minorstep;
 
+    /**
+     * \brief A counter measuring the number of gathers that have been completed
+     */
+    atomic<size_t> completed_gathers;
+
     /**
      * \brief A counter measuring the number of applys that have been completed
      */
     atomic<size_t> completed_applys;
 
+    /**
+     * \brief A counter measuring the number of scatters that have been completed
+     */
+    atomic<size_t> completed_scatters;
+
 
     /**
      * \brief The shared counter used coordinate operations between
@@ -1073,7 +1114,9 @@ namespace graphlab {
     // Clear up
     force_abort = false;
     iteration_counter = 0;
+    completed_gathers = 0;
     completed_applys = 0;
+    completed_scatters = 0;
     has_message.clear();
     has_gather_accum.clear();
     has_cache.clear();
@@ -1279,6 +1322,9 @@ namespace graphlab {
     // Start the timer
     graphlab::timer timer; timer.start();
     start_time = timer::approx_time_seconds();
+    exec_time = exch_time = recv_time =
+      gather_time = apply_time = scatter_time = 0.0;
+    graphlab::timer ti, bk_ti;
     iteration_counter = 0;
     force_abort = false;
     execution_status::status_enum termination_reason =
@@ -1300,6 +1346,7 @@ namespace graphlab {
                         << std::endl;
     }
     // Program Main loop ====================================================
+    ti.start();
     while(iteration_counter < max_iterations && !force_abort ) {
 
       // Check first to see if we are out of time
@@ -1311,7 +1358,7 @@ namespace graphlab {
       bool print_this_round = (elapsed_seconds() - last_print) >= 5;
 
       if(rmi.procid() == 0 && print_this_round) {
-        logstream(LOG_EMPH)
+        logstream(LOG_DEBUG)
           << rmi.procid() << ": Starting iteration: " << iteration_counter
           << std::endl;
         last_print = elapsed_seconds();
@@ -1326,7 +1373,9 @@ namespace graphlab {
       // Exchange Messages --------------------------------------------------
       // Exchange any messages in the local message vectors
       // if (rmi.procid() == 0) std::cout << "Exchange messages..." << std::endl;
+      bk_ti.start();
       run_synchronous( &synchronous_engine::exchange_messages );
+      exch_time += bk_ti.current_time();
       /**
        * Post conditions:
        *   1) only master vertices have messages
@@ -1339,11 +1388,13 @@ namespace graphlab {
 
       // if (rmi.procid() == 0) std::cout << "Receive messages..." << std::endl;
       num_active_vertices = 0;
+      bk_ti.start();
       run_synchronous( &synchronous_engine::receive_messages );
       if (sched_allv) {
         active_minorstep.fill();
       }
       has_message.clear();
+      recv_time += bk_ti.current_time();
       /**
        * Post conditions:
        *   1) there are no messages remaining
@@ -1372,11 +1423,13 @@ namespace graphlab {
       // Execute the gather operation for all vertices that are active
       // in this minor-step (active-minorstep bit set).
       // if (rmi.procid() == 0) std::cout << "Gathering..." << std::endl;
+      bk_ti.start();
       run_synchronous( &synchronous_engine::execute_gathers );
       // Clear the minor step bit since only super-step vertices
       // (only master vertices are required to participate in the
       // apply step)
       active_minorstep.clear(); // rmi.barrier();
+      gather_time += bk_ti.current_time();
       /**
        * Post conditions:
        *   1) gather_accum for all master vertices contains the
@@ -1388,7 +1441,9 @@ namespace graphlab {
       // Execute Apply Operations -------------------------------------------
       // Run the apply function on all active vertices
       // if (rmi.procid() == 0) std::cout << "Applying..." << std::endl;
+      bk_ti.start();
       run_synchronous( &synchronous_engine::execute_applys );
+      apply_time += bk_ti.current_time();
       /**
        * Post conditions:
        *   1) any changes to the vertex data have been synchronized
@@ -1403,13 +1458,15 @@ namespace graphlab {
 
       // Execute Scatter Operations -----------------------------------------
       // Execute each of the scatters on all minor-step active vertices.
+      bk_ti.start();
       run_synchronous( &synchronous_engine::execute_scatters );
+      scatter_time += bk_ti.current_time();
       /**
        * Post conditions:
        *   1) NONE
        */
       if(rmi.procid() == 0 && print_this_round)
-        logstream(LOG_EMPH) << "\t Running Aggregators" << std::endl;
+        logstream(LOG_DEBUG) << "\t Running Aggregators" << std::endl;
       // probe the aggregator
       aggregator.tick_synchronous();
 
@@ -1419,6 +1476,7 @@ namespace graphlab {
         graph.save_binary(snapshot_path);
       }
     }
+    exec_time = ti.current_time();
 
     if (rmi.procid() == 0) {
       logstream(LOG_EMPH) << iteration_counter
@@ -1433,16 +1491,48 @@ namespace graphlab {
     all_compute_time_vec[rmi.procid()] = total_compute_time;
     rmi.all_gather(all_compute_time_vec);
 
+    /*logstream(LOG_INFO) << "Local Calls(G|A|S): "
+                        << completed_gathers.value << "|" 
+                        << completed_applys.value << "|"
+                        << completed_scatters.value 
+                        << std::endl;*/
+    
     size_t global_completed = completed_applys;
     rmi.all_reduce(global_completed);
     completed_applys = global_completed;
     rmi.cout() << "Updates: " << completed_applys.value << "\n";
+
+#ifdef TUNING
+    global_completed = completed_gathers;
+    rmi.all_reduce(global_completed);
+    completed_gathers = global_completed;
+
+    global_completed = completed_scatters;
+    rmi.all_reduce(global_completed);
+    completed_scatters = global_completed;
+#endif
+
     if (rmi.procid() == 0) {
       logstream(LOG_INFO) << "Compute Balance: ";
       for (size_t i = 0;i < all_compute_time_vec.size(); ++i) {
         logstream(LOG_INFO) << all_compute_time_vec[i] << " ";
       }
+#ifdef TUNING
+      logstream(LOG_INFO) << "Total Calls(G|A|S): " 
+                          << completed_gathers.value << "|" 
+                          << completed_applys.value << "|"
+                          << completed_scatters.value 
+                          << std::endl;
       logstream(LOG_INFO) << std::endl;
+      logstream(LOG_EMPH) << "      Execution Time: " << exec_time << std::endl;
+      logstream(LOG_EMPH) << "Breakdown(X|R|G|A|S): " 
+                          << exch_time << "|"
+                          << recv_time << "|"
+                          << gather_time << "|"
+                          << apply_time << "|"
+                          << scatter_time
+                          << std::endl;
+#endif
     }
     rmi.full_barrier();
     // Stop the aggregator
@@ -1566,6 +1656,7 @@ namespace graphlab {
     const size_t TRY_RECV_MOD = 1000;
     size_t vcount = 0;
     const bool caching_enabled = !gather_cache.empty();
+    size_t ngather_inc = 0;
     timer ti;
 
     fixed_dense_bitset<8 * sizeof(size_t)> local_bitset; // a word-size = 64 bit
@@ -1616,7 +1707,7 @@ namespace graphlab {
               // elocks[local_edge.id()].unlock();
             }
           } // end of if in_edges/all_edges
-            // Loop over out edges
+          // Loop over out edges
           if(gather_dir == OUT_EDGES || gather_dir == ALL_EDGES) {
             foreach(local_edge_type local_edge, local_vertex.out_edges()) {
               edge_type edge(local_edge);
@@ -1630,8 +1721,9 @@ namespace graphlab {
               // elocks[local_edge.id()].unlock();
               ++edges_touched;
             }
-            INCREMENT_EVENT(EVENT_GATHERS, edges_touched);
           } // end of if out_edges/all_edges
+          INCREMENT_EVENT(EVENT_GATHERS, edges_touched);          
+          ++ngather_inc;
           vprog.post_local_gather(accum);
           // If caching is enabled then save the accumulator to the
           // cache for future iterations.  Note that it is possible
@@ -1652,10 +1744,11 @@ namespace graphlab {
         // try to recv gathers if there are any in the buffer
         if(++vcount % TRY_RECV_MOD == 0) recv_gathers();
       }
-    } // end of loop over vertices to compute gather accumulators
+    } // end of loop over vertices to compute gather accumulators    
+    completed_gathers += ngather_inc;
     per_thread_compute_time[thread_id] += ti.current_time();
     gather_exchange.partial_flush();
-      // Finish sending and receiving all gather operations
+    // Finish sending and receiving all gather operations
     thread_barrier.wait();
     if(thread_id == 0) gather_exchange.flush();
     thread_barrier.wait();
@@ -1669,6 +1762,7 @@ namespace graphlab {
     context_type context(*this, graph);
     const size_t TRY_RECV_MOD = 1000;
     size_t vcount = 0;
+    size_t napply_inc = 0;
     timer ti;
 
     fixed_dense_bitset<8 * sizeof(size_t)> local_bitset;  // allocate a word size = 64bits
@@ -1696,7 +1790,7 @@ namespace graphlab {
         INCREMENT_EVENT(EVENT_APPLIES, 1);
         vertex_programs[lvid].apply(context, vertex, accum);
         // record an apply as a completed task
-        ++completed_applys;
+        ++napply_inc;
         // Clear the accumulator to save some memory
         gather_accum[lvid] = gather_type();
         // synchronize the changed vertex data with all mirrors
@@ -1718,7 +1812,7 @@ namespace graphlab {
         }
       }
     } // end of loop over vertices to run apply
-
+    completed_applys += napply_inc;
     per_thread_compute_time[thread_id] += ti.current_time();
     vprog_exchange.partial_flush();
     vdata_exchange.partial_flush();
@@ -1739,6 +1833,7 @@ namespace graphlab {
   void synchronous_engine<VertexProgram>::
   execute_scatters(const size_t thread_id) {
     context_type context(*this, graph);
+    size_t nscatter_inc = 0;
     timer ti;
     fixed_dense_bitset<8 * sizeof(size_t)> local_bitset; // allocate a word size = 64 bits
     while (1) {
@@ -1760,7 +1855,7 @@ namespace graphlab {
         local_vertex_type local_vertex = graph.l_vertex(lvid);
         const vertex_type vertex(local_vertex);
         const edge_dir_type scatter_dir = vprog.scatter_edges(context, vertex);
-				size_t edges_touched = 0;
+        size_t edges_touched = 0;
         // Loop over in edges
         if(scatter_dir == IN_EDGES || scatter_dir == ALL_EDGES) {
           foreach(local_edge_type local_edge, local_vertex.in_edges()) {
@@ -1768,8 +1863,8 @@ namespace graphlab {
             // elocks[local_edge.id()].lock();
             vprog.scatter(context, vertex, edge);
             // elocks[local_edge.id()].unlock();
+            ++edges_touched;
           }
-					++edges_touched;
         } // end of if in_edges/all_edges
         // Loop over out edges
         if(scatter_dir == OUT_EDGES || scatter_dir == ALL_EDGES) {
@@ -1778,15 +1873,16 @@ namespace graphlab {
             // elocks[local_edge.id()].lock();
             vprog.scatter(context, vertex, edge);
             // elocks[local_edge.id()].unlock();
+            ++edges_touched;
           }
-					++edges_touched;
         } // end of if out_edges/all_edges
-				INCREMENT_EVENT(EVENT_SCATTERS, edges_touched);
+        INCREMENT_EVENT(EVENT_SCATTERS, edges_touched);
         // Clear the vertex program
         vertex_programs[lvid] = vertex_program_type();
+        ++nscatter_inc;
       } // end of if active on this minor step
     } // end of loop over vertices to complete scatter operation
-
+    completed_scatters += nscatter_inc;
     per_thread_compute_time[thread_id] += ti.current_time();
   } // end of execute_scatters
 
diff --git a/src/graphlab/graph/builtin_parsers.hpp b/src/graphlab/graph/builtin_parsers.hpp
index 64d34eef2e..d395183c49 100644
--- a/src/graphlab/graph/builtin_parsers.hpp
+++ b/src/graphlab/graph/builtin_parsers.hpp
@@ -1,3 +1,29 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2013.11  implement rtsv_parser for debugging 
+ *
+ */
+
+
 /**  
  * Copyright (c) 2009 Carnegie Mellon University. 
  *     All rights reserved.
@@ -95,6 +121,24 @@ namespace graphlab {
       return true;
     } // end of tsv parser
 
+    /**
+     * \brief Parse files in the reverse tsv format (for debugging)
+     *
+     * This is identical to the tsv format but reverse edge direction.
+     *
+     */
+    template <typename Graph>
+    bool rtsv_parser(Graph& graph, const std::string& srcfilename,
+                    const std::string& str) {
+      if (str.empty()) return true;
+      size_t source, target;
+      char* targetptr;
+      source = strtoul(str.c_str(), &targetptr, 10);
+      if (targetptr == NULL) return false;
+      target = strtoul(targetptr, NULL, 10);
+      if(source != target) graph.add_edge(target, source);
+      return true;
+    } // end of rtsv parser
 
     template <typename Graph>
     bool csv_parser(Graph& graph, 
@@ -189,8 +233,6 @@ namespace graphlab {
       }
     };
 
-
-
     
     template <typename Graph>
     struct graphjrl_writer{
diff --git a/src/graphlab/graph/distributed_graph.hpp b/src/graphlab/graph/distributed_graph.hpp
index e8fea5a0ee..eadcfc4368 100644
--- a/src/graphlab/graph/distributed_graph.hpp
+++ b/src/graphlab/graph/distributed_graph.hpp
@@ -1,3 +1,29 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.02  add calling to bipartitie-aware partitioning
+ * 2013.11  add calling to hybrid partitioning for power-law graphs
+ *
+ */
+
 /**
  * Copyright (c) 2009 Carnegie Mellon University.
  *     All rights reserved.
@@ -82,6 +108,15 @@
 #include <graphlab/graph/ingress/sharding_constraint.hpp>
 #include <graphlab/graph/ingress/distributed_constrained_random_ingress.hpp>
 
+// bipartite
+#include <graphlab/graph/ingress/distributed_bipartite_random_ingress.hpp>
+#include <graphlab/graph/ingress/distributed_bipartite_affinity_ingress.hpp>
+#include <graphlab/graph/ingress/distributed_bipartite_aweto_ingress.hpp>
+
+// hybrid 
+#include <graphlab/graph/ingress/distributed_hybrid_ingress.hpp>
+#include <graphlab/graph/ingress/distributed_hybrid_ginger_ingress.hpp>
+
 #include <graphlab/graph/graph_hash.hpp>
 
 #include <graphlab/util/hopscotch_map.hpp>
@@ -401,11 +436,21 @@ namespace graphlab {
     friend class distributed_identity_ingress<VertexData, EdgeData>;
     friend class distributed_oblivious_ingress<VertexData, EdgeData>;
     friend class distributed_constrained_random_ingress<VertexData, EdgeData>;
+    friend class distributed_bipartite_random_ingress<VertexData, EdgeData>;
+    friend class distributed_bipartite_affinity_ingress<VertexData, EdgeData>;
+    friend class distributed_bipartite_aweto_ingress<VertexData, EdgeData>;
+    friend class distributed_hybrid_ingress<VertexData, EdgeData>;
+    friend class distributed_hybrid_ginger_ingress<VertexData, EdgeData>;
 
     typedef graphlab::vertex_id_type vertex_id_type;
     typedef graphlab::lvid_type lvid_type;
     typedef graphlab::edge_id_type edge_id_type;
 
+    enum degree_type {HIGH = 0, LOW, NUM_DEGREE_TYPES};
+
+    enum cuts_type {VERTEX_CUTS = 0, EDGE_CUTS, HYBRID_CUTS, HYBRID_GINGER_CUTS,
+      NUM_CUTS_TYPES};
+
     struct vertex_type;
     typedef bool edge_list_type;
     class edge_type;
@@ -613,13 +658,13 @@ namespace graphlab {
                       const graphlab_options& opts = graphlab_options()) :
       rpc(dc, this), finalized(false), vid2lvid(),
       nverts(0), nedges(0), local_own_nverts(0), nreplicas(0),
-      ingress_ptr(NULL), 
+      how_cuts(VERTEX_CUTS), ingress_ptr(NULL),
 #ifdef _OPENMP
       vertex_exchange(dc, omp_get_max_threads()), 
 #else
       vertex_exchange(dc), 
 #endif
-      vset_exchange(dc), parallel_ingress(true) {
+      vset_exchange(dc), parallel_ingress(true), data_affinity(false) {
       rpc.barrier();
       set_options(opts);
     }
@@ -634,10 +679,23 @@ namespace graphlab {
     }
   private:
     void set_options(const graphlab_options& opts) {
+      std::string ingress_method = "";
+
+      // hybrid cut
+      size_t threshold = 100;
+      // ginger heuristic
+      size_t interval = std::numeric_limits<size_t>::max();
+      size_t nedges = 0;
+      size_t nverts = 0;
+      // bipartite
+      std::string favorite = "source"; /* source or target */
+
+      
+      // deprecated
       size_t bufsize = 50000;
       bool usehash = false;
       bool userecent = false;
-      std::string ingress_method = "";
+
       std::vector<std::string> keys = opts.get_graph_args().get_option_keys();
       foreach(std::string opt, keys) {
         if (opt == "ingress") {
@@ -650,30 +708,63 @@ namespace graphlab {
           if (!parallel_ingress && rpc.procid() == 0)
             logstream(LOG_EMPH) << "Disable parallel ingress. Graph will be streamed through one node."
               << std::endl;
+        } else if (opt == "threshold") {
+          opts.get_graph_args().get_option("threshold", threshold);
+          if (rpc.procid() == 0)
+            logstream(LOG_EMPH) << "Graph Option: threshold = "
+                                << threshold << std::endl;
+        } else if (opt == "interval") {
+          opts.get_graph_args().get_option("interval", interval);
+          if (rpc.procid() == 0)
+            logstream(LOG_EMPH) << "Graph Option: interval = "
+                                << interval << std::endl;
+        }  else if (opt == "nedges") {
+          opts.get_graph_args().get_option("nedges", nedges);
+          if (rpc.procid() == 0)
+            logstream(LOG_EMPH) << "Graph Option: nedges = "
+                                << nedges << std::endl;
+        } else if (opt == "nverts") {
+          opts.get_graph_args().get_option("nverts", nverts);
+          if (rpc.procid() == 0)
+            logstream(LOG_EMPH) << "Graph Option: nverts = "
+                                << nverts << std::endl;
+        } else if (opt == "affinity") {
+          opts.get_graph_args().get_option("affinity", data_affinity);
+          if (rpc.procid() == 0)
+            logstream(LOG_EMPH) << "Graph Option: affinity = "
+                                << data_affinity << std::endl;
+        } else if (opt == "favorite") {
+          opts.get_graph_args().get_option("favorite", favorite);
+          if(favorite != "target") favorite = "source";
+          if (rpc.procid() == 0)
+            logstream(LOG_EMPH) << "Graph Option: favorite = "
+                                << favorite << std::endl;
         }
+        
         /**
          * These options below are deprecated.
          */
         else if (opt == "bufsize") {
           opts.get_graph_args().get_option("bufsize", bufsize);
-           if (rpc.procid() == 0)
+          if (rpc.procid() == 0)
             logstream(LOG_EMPH) << "Graph Option: bufsize = "
               << bufsize << std::endl;
-       } else if (opt == "usehash") {
+        } else if (opt == "usehash") {
           opts.get_graph_args().get_option("usehash", usehash);
           if (rpc.procid() == 0)
             logstream(LOG_EMPH) << "Graph Option: usehash = "
               << usehash << std::endl;
         } else if (opt == "userecent") {
           opts.get_graph_args().get_option("userecent", userecent);
-           if (rpc.procid() == 0)
+          if (rpc.procid() == 0)
             logstream(LOG_EMPH) << "Graph Option: userecent = "
               << userecent << std::endl;
-        }  else {
+        } else {
           logstream(LOG_ERROR) << "Unexpected Graph Option: " << opt << std::endl;
         }
-    }
-      set_ingress_method(ingress_method, bufsize, usehash, userecent);
+      }
+      set_ingress_method(ingress_method, bufsize, usehash, userecent, favorite,
+        threshold, nedges, nverts, interval);
     }
 
   public:
@@ -892,7 +983,6 @@ namespace graphlab {
       return true;
     }
 
-
    /**
     * \brief Performs a map-reduce operation on each vertex in the
     * graph returning the result.
@@ -2186,7 +2276,8 @@ namespace graphlab {
 #endif
       for(size_t i = 0; i < graph_files.size(); ++i) {
         if ((parallel_ingress && (i % rpc.numprocs() == rpc.procid()))
-            || (!parallel_ingress && (rpc.procid() == 0))) {
+            || (!parallel_ingress && (rpc.procid() == 0))
+            || (data_affinity)) {
           logstream(LOG_EMPH) << "Loading graph from file: " << graph_files[i] << std::endl;
           // is it a gzip file ?
           const bool gzip = boost::ends_with(graph_files[i], ".gz");
@@ -2234,6 +2325,7 @@ namespace graphlab {
       if (graph_files.size() == 0) {
         logstream(LOG_WARNING) << "No files found matching " << prefix << std::endl;
       }
+
 #ifdef _OPENMP
 #pragma omp parallel for
 #endif
@@ -2420,6 +2512,9 @@ namespace graphlab {
       } else if (format == "tsv") {
         line_parser = builtin_parsers::tsv_parser<distributed_graph>;
         load(path, line_parser);
+      } else if (format == "rtsv") { // debug
+        line_parser = builtin_parsers::rtsv_parser<distributed_graph>;
+        load(path, line_parser);
       } else if (format == "csv") {
         line_parser = builtin_parsers::csv_parser<distributed_graph>;
         load(path, line_parser);
@@ -2597,6 +2692,8 @@ namespace graphlab {
     struct vertex_record {
       /// The official owning processor for this vertex
       procid_t owner;
+      /// The degree type of vertex
+      degree_type dtype;
       /// The local vid of this vertex on this proc
       vertex_id_type gvid;
       /// The number of in edges
@@ -2605,9 +2702,9 @@ namespace graphlab {
           NOT be in this set.*/
       mirror_type _mirrors;
       vertex_record() :
-        owner(-1), gvid(-1), num_in_edges(0), num_out_edges(0) { }
+        owner(-1), dtype(HIGH), gvid(-1), num_in_edges(0), num_out_edges(0) { }
       vertex_record(const vertex_id_type& vid) :
-        owner(-1), gvid(vid), num_in_edges(0), num_out_edges(0) { }
+        owner(-1), dtype(HIGH), gvid(vid), num_in_edges(0), num_out_edges(0) { }
       procid_t get_owner () const { return owner; }
       const mirror_type& mirrors() const { return _mirrors; }
       size_t num_mirrors() const { return _mirrors.popcount(); }
@@ -2619,6 +2716,7 @@ namespace graphlab {
       void load(iarchive& arc) {
         clear();
         arc >> owner
+            >> dtype
             >> gvid
             >> num_in_edges
             >> num_out_edges
@@ -2627,6 +2725,7 @@ namespace graphlab {
 
       void save(oarchive& arc) const {
         arc << owner
+            << dtype
             << gvid
             << num_in_edges
             << num_out_edges
@@ -2636,6 +2735,7 @@ namespace graphlab {
       bool operator==(const vertex_record& other) const {
         return (
             (owner == other.owner) &&
+            (dtype == other.dtype) &&
             (gvid == other.gvid)  &&
             (num_in_edges == other.num_in_edges) &&
             (num_out_edges == other.num_out_edges) && 
@@ -2812,6 +2912,13 @@ namespace graphlab {
       return lvid2record[lvid].owner;
     }
 
+    /** \internal
+     * \brief Returns the type of vertex.
+     */
+    degree_type l_degree_type(lvid_type lvid) const {
+      ASSERT_LT(lvid, lvid2record.size());
+      return lvid2record[lvid].dtype;
+    }
 
     /** \internal
      *  \brief Returns a reference to the internal graph representation
@@ -3121,6 +3228,10 @@ namespace graphlab {
 
   public:
 
+    cuts_type get_cuts_type() const { return how_cuts; }
+
+    void set_cuts_type(cuts_type type) { how_cuts = type; }
+
     // For the warp engine to find the remote instances of this class
     size_t get_rpc_obj_id() {
       return rpc.get_obj_id();
@@ -3152,6 +3263,9 @@ namespace graphlab {
     /** The global number of vertex replica */
     size_t nreplicas;
 
+    /** The cut type */
+    cuts_type how_cuts;
+
     /** pointer to the distributed ingress object*/
     distributed_ingress_base<VertexData, EdgeData>* ingress_ptr;
 
@@ -3164,11 +3278,16 @@ namespace graphlab {
     /** Command option to disable parallel ingress. Used for simulating single node ingress */
     bool parallel_ingress;
 
+    /** Command option to enable data affinity. Currently only supported by bipartite */
+    bool data_affinity;
 
     lock_manager_type lock_manager;
 
     void set_ingress_method(const std::string& method,
-        size_t bufsize = 50000, bool usehash = false, bool userecent = false) {
+        size_t bufsize = 50000, bool usehash = false, bool userecent = false, 
+        std::string favorite = "source",
+        size_t threshold = 100, size_t nedges = 0, size_t nverts = 0,
+        size_t interval = std::numeric_limits<size_t>::max()) {
       if(ingress_ptr != NULL) { delete ingress_ptr; ingress_ptr = NULL; }
       if (method == "oblivious") {
         if (rpc.procid() == 0) logstream(LOG_EMPH) << "Use oblivious ingress, usehash: " << usehash
@@ -3183,9 +3302,29 @@ namespace graphlab {
       } else if (method == "pds") {
         if (rpc.procid() == 0)logstream(LOG_EMPH) << "Use pds ingress" << std::endl;
         ingress_ptr = new distributed_constrained_random_ingress<VertexData, EdgeData>(rpc.dc(), *this, "pds");
+      } else if (method == "bipartite") {
+        if(data_affinity){
+          if (rpc.procid() == 0) logstream(LOG_EMPH) << "Use bipartite ingress w/ affinity" << std::endl;
+          ingress_ptr = new distributed_bipartite_affinity_ingress<VertexData, EdgeData>(rpc.dc(), *this, favorite);
+        } else{
+          if (rpc.procid() == 0) logstream(LOG_EMPH) << "Use bipartite ingress w/o affinity" << std::endl;
+          ingress_ptr = new distributed_bipartite_random_ingress<VertexData, EdgeData>(rpc.dc(), *this, favorite);
+        }
+      } else if (method == "bipartite_aweto") {
+        if (rpc.procid() == 0) logstream(LOG_EMPH) << "Use bipartite_aweto ingress" << std::endl;
+        ingress_ptr = new distributed_bipartite_aweto_ingress<VertexData, EdgeData>(rpc.dc(), *this, favorite);
+      } else if (method == "hybrid") {
+        if (rpc.procid() == 0) logstream(LOG_EMPH) << "Use hybrid ingress" << std::endl;
+        ingress_ptr = new distributed_hybrid_ingress<VertexData, EdgeData>(rpc.dc(), *this, threshold);
+        set_cuts_type(HYBRID_CUTS);
+      } else if (method == "hybrid_ginger") {
+        if (rpc.procid() == 0) logstream(LOG_EMPH) << "Use hybrid ginger ingress" << std::endl;
+        ASSERT_GT(nedges, 0); ASSERT_GT(nverts, 0);
+        ingress_ptr = new distributed_hybrid_ginger_ingress<VertexData, EdgeData>(rpc.dc(), *this, threshold, nedges, nverts, interval);
+        set_cuts_type(HYBRID_GINGER_CUTS);
       } else {
         // use default ingress method if none is specified
-        std::string ingress_auto="";
+        std::string ingress_auto = "";
         size_t num_shards = rpc.numprocs();
         int nrow, ncol, p;
         if (sharding_constraint::is_pds_compatible(num_shards, p)) {
@@ -3198,7 +3337,7 @@ namespace graphlab {
           ingress_auto="oblivious";
           ingress_ptr = new distributed_oblivious_ingress<VertexData, EdgeData>(rpc.dc(), *this, usehash, userecent);
         }
-        if (rpc.procid() == 0)logstream(LOG_EMPH) << "Automatically determine ingress method: " << ingress_auto << std::endl;
+        if (rpc.procid() == 0) logstream(LOG_EMPH) << "Automatically determine ingress method: " << ingress_auto << std::endl;
       }
       // batch ingress is deprecated
       // if (method == "batch") {
diff --git a/src/graphlab/graph/ingress/distributed_bipartite_affinity_ingress.hpp b/src/graphlab/graph/ingress/distributed_bipartite_affinity_ingress.hpp
new file mode 100644
index 0000000000..c17ddc7f55
--- /dev/null
+++ b/src/graphlab/graph/ingress/distributed_bipartite_affinity_ingress.hpp
@@ -0,0 +1,237 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.04  implement bipartite-aware partitioning with affinity
+ *
+ */
+
+
+#ifndef GRAPHLAB_DISTRIBUTED_BIPARTITE_AFFINITY_INGRESS_HPP
+#define GRAPHLAB_DISTRIBUTED_BIPARTITE_AFFINITY_INGRESS_HPP
+
+#include <boost/functional/hash.hpp>
+
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/graph/graph_basic_types.hpp>
+#include <graphlab/graph/ingress/distributed_ingress_base.hpp>
+#include <graphlab/graph/distributed_graph.hpp>
+
+#include <graphlab/macros_def.hpp>
+#include <map>
+#include <set>
+#include <vector>
+#include <graphlab/parallel/pthread_tools.hpp>
+
+#define TUNING
+namespace graphlab {
+  template<typename VertexData, typename EdgeData>
+  class distributed_graph;
+
+  /**
+   * \brief Ingress object assigning edges with data affinity for bipartite graph.
+   */
+  template<typename VertexData, typename EdgeData>
+  class distributed_bipartite_affinity_ingress : 
+    public distributed_ingress_base<VertexData, EdgeData> {
+  public:
+    typedef distributed_graph<VertexData, EdgeData> graph_type;
+    /// The type of the vertex data stored in the graph 
+    typedef VertexData vertex_data_type;
+    /// The type of the edge data stored in the graph 
+    typedef EdgeData edge_data_type;
+
+
+    typedef distributed_ingress_base<VertexData, EdgeData> base_type;
+
+    typedef typename graph_type::vertex_record vertex_record;
+
+    typedef typename base_type::edge_buffer_record edge_buffer_record;
+    typedef typename base_type::vertex_buffer_record vertex_buffer_record;
+    
+    /// The rpc interface for this object
+    dc_dist_object<distributed_bipartite_affinity_ingress> bipartite_rpc;
+    /// The underlying distributed graph object that is being loaded
+    graph_type& graph;
+
+    simple_spinlock bipartite_vertex_lock;
+    std::vector<vertex_buffer_record> bipartite_vertexs;
+    simple_spinlock bipartite_edge_lock;
+    std::vector<edge_buffer_record> bipartite_edges;
+
+    bool favorite_source;
+
+    typedef typename boost::unordered_map<vertex_id_type, procid_t> 
+        master_hash_table_type;
+    typedef typename std::pair<vertex_id_type, procid_t> 
+        master_pair_type;
+    typedef typename buffered_exchange<master_pair_type>::buffer_type 
+        master_buffer_type;
+
+    master_hash_table_type mht;
+    buffered_exchange<master_pair_type> mht_exchange;
+
+  public:
+    distributed_bipartite_affinity_ingress(distributed_control& dc, graph_type& graph, const std::string& favorite):
+    base_type(dc, graph), bipartite_rpc(dc, this), graph(graph), mht_exchange(dc) {
+      favorite_source = favorite == "source" ? true : false;
+    } // end of constructor
+
+    ~distributed_bipartite_affinity_ingress() { }
+
+    /** Add an edge to the ingress object using random assignment. */
+    void add_edge(vertex_id_type source, vertex_id_type target,
+                  const EdgeData& edata) {
+      const edge_buffer_record record(source, target, edata);
+      bipartite_edge_lock.lock();
+      bipartite_edges.push_back(record);
+      bipartite_edge_lock.unlock();
+    } // end of add edge
+    
+    void add_vertex(vertex_id_type vid, const VertexData& vdata) { 
+      const vertex_buffer_record record(vid, vdata);
+      bipartite_vertex_lock.lock();
+      bipartite_vertexs.push_back(record);
+      mht[vid] = bipartite_rpc.procid();
+      bipartite_vertex_lock.unlock();
+    } // end of add vertex
+
+    void finalize() {
+      graphlab::timer ti;
+      
+      size_t nprocs = bipartite_rpc.numprocs();
+      procid_t l_procid = bipartite_rpc.procid();
+
+      bipartite_rpc.full_barrier();
+
+      if (l_procid == 0) {
+        memory_info::log_usage("start finalizing");
+        logstream(LOG_EMPH) << "bipartite w/ affinity finalizing ..."
+                            << " #verts=" << graph.local_graph.num_vertices()
+                            << " #edges=" << graph.local_graph.num_edges()
+                            << " favorite=" << (favorite_source ? "source" : "target")
+                            << std::endl;
+      }
+
+
+      /**
+       * Fast pass for redundant finalization with no graph changes. 
+       */
+      {
+        size_t changed_size = bipartite_vertexs.size() + bipartite_edges.size();
+        bipartite_rpc.all_reduce(changed_size);
+        if (changed_size == 0) {
+          logstream(LOG_INFO) << "Skipping Graph Finalization because no changes happened..." << std::endl;
+          return;
+        }
+      }
+
+      /* directly add vertices loaded from local file to the local graph */
+      const size_t local_nverts = bipartite_vertexs.size();
+      graph.lvid2record.resize(local_nverts);
+      graph.local_graph.resize(local_nverts);
+      lvid_type lvid = 0;
+      foreach(const vertex_buffer_record& rec, bipartite_vertexs){
+        graph.vid2lvid[rec.vid] = lvid;
+        graph.local_graph.add_vertex(lvid,rec.vdata);
+        vertex_record& vrec = graph.lvid2record[lvid];
+        vrec.gvid = rec.vid;
+        vrec.owner = bipartite_rpc.procid();
+        lvid++;
+      }
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "add " << local_nverts << " vertex: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /* exchange mapping table using mht_exchange */
+      for (typename master_hash_table_type::iterator it = mht.begin(); 
+          it != mht.end(); ++it) {
+        for (procid_t i = 0; i < nprocs; ++i) {
+          if (i != l_procid)
+            mht_exchange.send(i, master_pair_type(it->first, it->second));
+        }
+      }
+
+      mht_exchange.flush();
+      master_buffer_type master_buffer;
+      procid_t proc = -1;
+      while(mht_exchange.recv(proc, master_buffer)) {
+        foreach(const master_pair_type& pair, master_buffer) {
+          mht[pair.first] = pair.second;
+        }
+      }
+      mht_exchange.clear();
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "exchange mapping: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /* distribute edges */ 
+      foreach(const edge_buffer_record& rec, bipartite_edges){
+        vertex_id_type favorite = favorite_source ? rec.source : rec.target;
+        if(mht.find(favorite) == mht.end())
+          mht[favorite] = graph_hash::hash_vertex(favorite) % nprocs;
+        const procid_t owning_proc = mht[favorite];
+        // save to the buffer of edge_exchange in ingress_base
+        base_type::edge_exchange.send(owning_proc, rec);
+      }
+      bipartite_vertexs.clear();
+      bipartite_edges.clear();
+      mht.clear();
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "distribute edges: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+      // call base finalize()
+      base_type::finalize();
+      if(l_procid == 0) {
+        memory_info::log_usage("bipartite w/ affinity finalizing graph done.");
+        logstream(LOG_EMPH) << "bipartite w/ affinity finalizing graph. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+    } // end of finalize
+
+  }; // end of distributed_bipartite_affinity_ingress
+}; // end of namespace graphlab
+#include <graphlab/macros_undef.hpp>
+
+
+#endif
diff --git a/src/graphlab/graph/ingress/distributed_bipartite_aweto_ingress.hpp b/src/graphlab/graph/ingress/distributed_bipartite_aweto_ingress.hpp
new file mode 100644
index 0000000000..17a02e2975
--- /dev/null
+++ b/src/graphlab/graph/ingress/distributed_bipartite_aweto_ingress.hpp
@@ -0,0 +1,350 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.04  implement bipartite-aware partitioning with heuristic (aweto)
+ *
+ */
+
+
+#ifndef GRAPHLAB_DISTRIBUTED_BIPARTITE_AWETO_INGRESS_HPP
+#define GRAPHLAB_DISTRIBUTED_BIPARTITE_AWETO_INGRESS_HPP
+
+#include <boost/functional/hash.hpp>
+
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/graph/graph_basic_types.hpp>
+#include <graphlab/graph/ingress/distributed_ingress_base.hpp>
+#include <graphlab/graph/distributed_graph.hpp>
+#include <graphlab/graph/ingress/sharding_constraint.hpp>
+#include <graphlab/graph/ingress/ingress_edge_decision.hpp>
+
+
+#include <graphlab/macros_def.hpp>
+#include <map>
+#include <set>
+#include <vector>
+
+#define TUNING
+namespace graphlab {
+  template<typename VertexData, typename EdgeData>
+  class distributed_graph;
+
+  /**
+   * \brief Ingress object benefit for bipartite graph.
+   */
+  template<typename VertexData, typename EdgeData>
+  class distributed_bipartite_aweto_ingress : 
+    public distributed_ingress_base<VertexData, EdgeData> {
+  public:
+    typedef distributed_graph<VertexData, EdgeData> graph_type;
+    /// The type of the vertex data stored in the graph 
+    typedef VertexData vertex_data_type;
+    /// The type of the edge data stored in the graph 
+    typedef EdgeData edge_data_type;
+
+
+    typedef distributed_ingress_base<VertexData, EdgeData> base_type;
+
+    typedef typename graph_type::vertex_record vertex_record;
+    
+    typedef typename base_type::edge_buffer_record edge_buffer_record;
+    typedef typename buffered_exchange<edge_buffer_record>::buffer_type 
+        edge_buffer_type;
+
+    typedef typename base_type::vertex_buffer_record vertex_buffer_record;
+    typedef typename buffered_exchange<vertex_buffer_record>::buffer_type 
+        vertex_buffer_type;
+
+
+    /// The rpc interface for this object
+    dc_dist_object<distributed_bipartite_aweto_ingress> bipartite_rpc;
+    /// The underlying distributed graph object that is being loaded
+    graph_type& graph;
+
+    std::vector<edge_buffer_record> bipartite_edges;
+
+    bool favorite_source;
+
+    /* ingress exchange */
+    buffered_exchange<vertex_buffer_record> bipartite_vertex_exchange;
+    buffered_exchange<edge_buffer_record> bipartite_edge_exchange;
+
+  public:
+    distributed_bipartite_aweto_ingress(distributed_control& dc, graph_type& graph, const std::string& favorite) :
+    base_type(dc, graph), bipartite_rpc(dc, this), graph(graph),
+#ifdef _OPENMP
+    bipartite_vertex_exchange(dc, omp_get_max_threads()), 
+    bipartite_edge_exchange(dc, omp_get_max_threads())
+#else
+    bipartite_vertex_exchange(dc),bipartite_edge_exchange(dc)
+#endif
+    {
+      favorite_source = (favorite == "source") ? true : false;      
+    } // end of constructor
+
+    ~distributed_bipartite_aweto_ingress() { 
+      
+    }
+
+    /** accumulate edges temporal rally point using random of "favorite" assignment. */
+    void add_edge(vertex_id_type source, vertex_id_type target,
+                  const EdgeData& edata) {
+      vertex_id_type favorite = favorite_source ? source : target;
+      const procid_t owning_proc = 
+        graph_hash::hash_vertex(favorite) % bipartite_rpc.numprocs();
+      const edge_buffer_record record(source, target, edata);
+#ifdef _OPENMP
+      bipartite_edge_exchange.send(owning_proc, record, omp_get_thread_num());
+#else
+      bipartite_edge_exchange.send(owning_proc, record);
+#endif
+    } // end of add edge
+
+    /** accumulate edges temporal rally point using random of "favorite" assignment. */
+    void add_vertex(vertex_id_type vid, const VertexData& vdata) { 
+      const procid_t owning_proc =
+        graph_hash::hash_vertex(vid) % bipartite_rpc.numprocs();
+      const vertex_buffer_record record(vid, vdata);
+#ifdef _OPENMP
+      bipartite_vertex_exchange.send(owning_proc, record, omp_get_thread_num());
+#else
+      bipartite_vertex_exchange.send(owning_proc, record);
+#endif
+    } // end of add vertex
+
+    void finalize() {
+      graphlab::timer ti;
+
+      size_t nprocs = bipartite_rpc.numprocs();
+      procid_t l_procid = bipartite_rpc.procid();
+
+
+      bipartite_rpc.full_barrier();
+
+      if (l_procid == 0) {
+        memory_info::log_usage("start finalizing");
+        logstream(LOG_EMPH) << "bipartite aweto finalizing ..."
+                            << " #verts=" << graph.local_graph.num_vertices()
+                            << " #edges=" << graph.local_graph.num_edges()
+                            << " favorite=" << (favorite_source ? "source" : "target")
+                            << std::endl;
+      }
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                       Flush any additional data                        */
+      /*                                                                        */
+      /**************************************************************************/
+      bipartite_edge_exchange.flush(); bipartite_vertex_exchange.flush();
+
+      /**
+       * Fast pass for redundant finalization with no graph changes. 
+       */
+      {
+        size_t changed_size = bipartite_edge_exchange.size() + bipartite_vertex_exchange.size();
+        bipartite_rpc.all_reduce(changed_size);
+        if (changed_size == 0) {
+          logstream(LOG_INFO) << "Skipping Graph Finalization because no changes happened..." << std::endl;
+          return;
+        }
+      }
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*     calculate the distribution of favorite vertex's neighbors    */
+      /*                                                                        */
+      /**************************************************************************/
+      boost::unordered_map<vertex_id_type,std::vector<int> > count_map;
+      edge_buffer_type edge_buffer;
+      procid_t proc(-1);
+      while(bipartite_edge_exchange.recv(proc, edge_buffer)) {
+        foreach(const edge_buffer_record& rec, edge_buffer) {
+          vertex_id_type favorite, second;
+          if (favorite_source)  { favorite = rec.source; second = rec.target; }
+          else { favorite = rec.target; second = rec.source; }
+          
+          if(count_map.find(favorite) == count_map.end())
+            count_map[favorite].resize(nprocs);
+
+          const procid_t owner_proc = graph_hash::hash_vertex(second) % nprocs;
+          count_map[favorite][owner_proc] += 1;
+
+          bipartite_edges.push_back(rec);
+        }
+      }
+      bipartite_edge_exchange.clear();
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*      record heuristic location of favorite vertex      */
+      /*                                                                        */
+      /**************************************************************************/
+      buffered_exchange<vertex_id_type> vid_buffer(bipartite_rpc.dc());
+      std::set<vertex_id_type> own_vid_set;
+      
+      // record current nedges distributed from this machine.
+      std::vector<int> proc_num_edges(nprocs);      
+      boost::unordered_map<vertex_id_type, procid_t> mht;
+
+      for(typename boost::unordered_map<vertex_id_type, std::vector<int> >::iterator it = count_map.begin();
+          it != count_map.end(); ++it) {
+        procid_t best_proc = l_procid;
+        // heuristic score
+        double best_score = (it->second)[best_proc] 
+                          - sqrt(1.0*proc_num_edges[best_proc]);
+
+        for(size_t i = 0; i < bipartite_rpc.numprocs(); i++) {
+          double score = (it->second)[i] 
+                               - sqrt(1.0*proc_num_edges[i]);
+          if(score > best_score) {
+            best_proc  = i;
+            best_score = score;
+          }
+        }
+
+        // update nedges
+        for(size_t i = 0; i < bipartite_rpc.numprocs(); i++)
+          proc_num_edges[best_proc] += (it->second)[i];
+
+        mht[it->first] = best_proc;
+        vid_buffer.send(best_proc, it->first);
+      }
+
+      // find all favorite vertices this machine own
+      vid_buffer.flush();
+      {
+        typename buffered_exchange<vertex_id_type>::buffer_type buffer;
+        procid_t recvid(-1);
+        while(vid_buffer.recv(recvid, buffer)) {
+          foreach(const vertex_id_type vid, buffer)
+            own_vid_set.insert(vid);
+        }
+      }
+      vid_buffer.clear();
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "hold " << own_vid_set.size() << " masters: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                          exchange edges                              */
+      /*                                                                        */
+      /**************************************************************************/
+      for (size_t i = 0; i < bipartite_edges.size(); i++) {
+        edge_buffer_record& rec = bipartite_edges[i];
+        vertex_id_type favorite = favorite_source ? rec.source : rec.target;
+        procid_t owner_proc = mht[favorite];        
+        // save to the buffer of edge_exchange in ingress_base
+        base_type::edge_exchange.send(owner_proc,rec);   
+      }
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "exchange edges: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                     add vertices  to local graph                 */
+      /*                                                                        */
+      /**************************************************************************/
+      graph.lvid2record.resize(own_vid_set.size());
+      graph.local_graph.resize(own_vid_set.size());
+      lvid_type lvid = 0;
+      foreach(const vertex_id_type& vid, own_vid_set){
+        graph.vid2lvid[vid] = lvid;
+        vertex_record& vrec = graph.lvid2record[lvid];
+        vrec.gvid = vid;
+        vrec.owner = bipartite_rpc.procid();
+        lvid++;
+      }
+      own_vid_set.clear();
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "add vertices to local graph: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                   re-send favorite vertex data                    */
+      /*                                                                        */
+      /**************************************************************************/
+      {
+        vertex_buffer_type vertex_buffer; procid_t sending_proc(-1);
+        while(bipartite_vertex_exchange.recv(sending_proc, vertex_buffer)) {
+          foreach(const vertex_buffer_record& rec, vertex_buffer) {
+            if(mht.find(rec.vid) != mht.end()) {
+              base_type::vertex_exchange.send(mht[rec.vid], rec);
+            } else {
+              base_type::vertex_exchange.send(l_procid, rec);
+            }
+          }
+        }
+        bipartite_vertex_exchange.clear();
+        mht.clear();
+      }
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "exchange vertex data: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      // call base finalize()
+      base_type::finalize();
+      if(l_procid == 0) {
+        memory_info::log_usage("bipartite aweto finalizing graph done.");
+        logstream(LOG_EMPH) << "bipartite aweto finalizing graph. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+    } // end of finalize
+
+  }; // end of distributed_bipartite_aweto_ingress
+}; // end of namespace graphlab
+#include <graphlab/macros_undef.hpp>
+
+
+#endif
diff --git a/src/graphlab/graph/ingress/distributed_bipartite_random_ingress.hpp b/src/graphlab/graph/ingress/distributed_bipartite_random_ingress.hpp
new file mode 100644
index 0000000000..e42e2d0c4c
--- /dev/null
+++ b/src/graphlab/graph/ingress/distributed_bipartite_random_ingress.hpp
@@ -0,0 +1,82 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2014.04  implement bipartite-aware random partitioning
+ *
+ */
+
+
+#ifndef GRAPHLAB_DISTRIBUTED_BIPARTITE_RANDOM_INGRESS_HPP
+#define GRAPHLAB_DISTRIBUTED_BIPARTITE_RANDOM_INGRESS_HPP
+
+#include <boost/functional/hash.hpp>
+
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/graph/graph_basic_types.hpp>
+#include <graphlab/graph/ingress/distributed_ingress_base.hpp>
+#include <graphlab/graph/distributed_graph.hpp>
+
+
+#include <graphlab/macros_def.hpp>
+namespace graphlab {
+  template<typename VertexData, typename EdgeData>
+  class distributed_graph;
+
+  /**
+   * \brief Ingress object assigning edges using randoming hash function on favorite.
+   */
+  template<typename VertexData, typename EdgeData>
+  class distributed_bipartite_random_ingress : 
+    public distributed_ingress_base<VertexData, EdgeData> {
+  public:
+    typedef distributed_graph<VertexData, EdgeData> graph_type;
+    /// The type of the vertex data stored in the graph 
+    typedef VertexData vertex_data_type;
+    /// The type of the edge data stored in the graph 
+    typedef EdgeData edge_data_type;
+
+
+    typedef distributed_ingress_base<VertexData, EdgeData> base_type;
+    
+    bool favorite_source;
+  public:
+    distributed_bipartite_random_ingress(distributed_control& dc, graph_type& graph, const std::string& favorite) :
+    base_type(dc, graph) {
+      favorite_source = (favorite == "source") ? true : false;
+    } // end of constructor
+
+    ~distributed_bipartite_random_ingress() { }
+
+    /** Add an edge to the ingress object using random of "favorite" assignment. */
+    void add_edge(vertex_id_type source, vertex_id_type target,
+                  const EdgeData& edata) {
+      typedef typename base_type::edge_buffer_record edge_buffer_record;
+      vertex_id_type favorite = favorite_source ? source : target;
+      const procid_t owning_proc = graph_hash::hash_vertex(favorite) % base_type::rpc.numprocs();
+      const edge_buffer_record record(source, target, edata);
+      base_type::edge_exchange.send(owning_proc, record);
+    } // end of add edge
+  }; // end of distributed_bipartite_random_ingress
+}; // end of namespace graphlab
+#include <graphlab/macros_undef.hpp>
+
+
+#endif
diff --git a/src/graphlab/graph/ingress/distributed_hybrid_ginger_ingress.hpp b/src/graphlab/graph/ingress/distributed_hybrid_ginger_ingress.hpp
new file mode 100644
index 0000000000..9af3fc5c2c
--- /dev/null
+++ b/src/graphlab/graph/ingress/distributed_hybrid_ginger_ingress.hpp
@@ -0,0 +1,1031 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2013.11  implement hybrid partitioning with heuristic (ginger)
+ *
+ */
+
+
+#ifndef GRAPHLAB_DISTRIBUTED_HYBRID_GINGER_INGRESS_HPP
+#define GRAPHLAB_DISTRIBUTED_HYBRID_GINGER_INGRESS_HPP
+
+#include <boost/functional/hash.hpp>
+
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/graph/graph_basic_types.hpp>
+#include <graphlab/graph/ingress/distributed_ingress_base.hpp>
+#include <graphlab/graph/distributed_graph.hpp>
+#include <graphlab/graph/graph_hash.hpp>
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/rpc/distributed_event_log.hpp>
+#include <graphlab/parallel/pthread_tools.hpp>
+#include <graphlab/logger/logger.hpp>
+#include <vector>
+#include <map>
+#include <set>
+#include <algorithm>
+
+#include <graphlab/macros_def.hpp>
+
+#define TUNING
+namespace graphlab {
+  template<typename VertexData, typename EdgeData>
+  class distributed_graph;
+
+  /**
+   * \brief Ingress object assigning edges using a hybrid method.
+   *        That is, for high degree edges, 
+   *        for low degree edges, hashing from its target vertex.
+   */
+  template<typename VertexData, typename EdgeData>
+  class distributed_hybrid_ginger_ingress : 
+    public distributed_ingress_base<VertexData, EdgeData> {
+
+  public:
+    typedef distributed_graph<VertexData, EdgeData> graph_type;
+    /// The type of the vertex data stored in the graph 
+    typedef VertexData vertex_data_type;
+    /// The type of the edge data stored in the graph 
+    typedef EdgeData   edge_data_type;
+
+    typedef distributed_ingress_base<VertexData, EdgeData> base_type;
+
+    typedef typename graph_type::vertex_record vertex_record;
+    typedef typename graph_type::mirror_type mirror_type;
+
+
+    typedef typename buffered_exchange<vertex_id_type>::buffer_type
+        vertex_id_buffer_type;
+
+    typedef typename base_type::edge_buffer_record edge_buffer_record;
+    typedef typename buffered_exchange<edge_buffer_record>::buffer_type 
+        edge_buffer_type;
+
+    typedef typename base_type::vertex_buffer_record vertex_buffer_record;
+    typedef typename buffered_exchange<vertex_buffer_record>::buffer_type 
+        vertex_buffer_type;
+
+    typedef typename boost::unordered_map<vertex_id_type, 
+      std::vector<edge_buffer_record> > raw_map_type;
+    
+    /// detail vertex record for the second pass coordination. 
+    typedef typename base_type::vertex_negotiator_record 
+        vertex_negotiator_record;
+
+    /// ginger structure
+    /** Type of the master location hash table: [vertex-id, location-of-master] */
+    typedef typename boost::unordered_map<vertex_id_type, procid_t> 
+        master_hash_table_type;
+    typedef typename std::pair<vertex_id_type, procid_t> 
+        master_pair_type;
+    typedef typename buffered_exchange<master_pair_type>::buffer_type 
+        master_buffer_type;
+
+    typedef typename std::pair<procid_t,size_t> 
+        proc_score_pair_type;
+    typedef typename buffered_exchange<proc_score_pair_type >::buffer_type
+        proc_score_buffer_type;
+
+    /// The rpc interface for this object
+    dc_dist_object<distributed_hybrid_ginger_ingress> hybrid_rpc;
+    /// The underlying distributed graph object that is being loaded
+    graph_type& graph;
+
+    /// threshold to divide high-degree and low-degree vertices
+    size_t threshold;
+
+    bool standalone;
+
+    std::vector<edge_buffer_record> hybrid_edges;
+
+    /* ingress exchange */
+    buffered_exchange<edge_buffer_record> hybrid_edge_exchange;
+    buffered_exchange<vertex_buffer_record> hybrid_vertex_exchange;
+    
+    buffered_exchange<edge_buffer_record> high_edge_exchange;
+    buffered_exchange<edge_buffer_record>  low_edge_exchange;
+    buffered_exchange<vertex_buffer_record> resend_vertex_exchange;    
+
+    /** master hash table (mht): location mapping of low-degree vertices  */
+    master_hash_table_type mht;
+    buffered_exchange<master_pair_type> mht_exchange;
+
+    // consider both #edge and #vertex
+    std::vector<size_t> proc_balance;
+    std::vector<size_t> proc_score_incr;
+    buffered_exchange< proc_score_pair_type > proc_score_exchange;
+
+    /// heuristic model from fennel
+    /// records about the number of edges and vertices in the graph
+    /// given from the commandline
+    size_t tot_nedges;
+    size_t tot_nverts;
+    /// threshold for incremental mht to be synced across the cluster
+    /// when the incremental mht size reaches the preset interval,
+    /// we will perform a synchronization on mht across the cluster
+    size_t interval;
+    /// arguments for the ginger algorithm
+    double alpha;
+    double gamma;
+
+
+  public:
+    distributed_hybrid_ginger_ingress(distributed_control& dc, graph_type& graph, 
+        size_t threshold = 100, size_t tot_nedges = 0, size_t tot_nverts = 0, 
+        size_t interval = std::numeric_limits<size_t>::max()) :
+        base_type(dc, graph), hybrid_rpc(dc, this), 
+        graph(graph), threshold(threshold), 
+#ifdef _OPENMP
+        hybrid_edge_exchange(dc, omp_get_max_threads()),
+        hybrid_vertex_exchange(dc, omp_get_max_threads()),
+#else
+        hybrid_edge_exchange(dc), 
+        hybrid_vertex_exchange(dc),
+#endif
+        high_edge_exchange(dc), low_edge_exchange(dc), resend_vertex_exchange(dc), 
+        mht_exchange(dc), proc_balance(dc.numprocs()), 
+        proc_score_incr(dc.numprocs()), proc_score_exchange(dc),
+        tot_nedges(tot_nedges), tot_nverts(tot_nverts), interval(interval) { 
+      ASSERT_GT(tot_nedges, 0); ASSERT_GT(tot_nverts, 0);
+      
+      gamma = 1.5;
+      alpha = sqrt(dc.numprocs()) * double(tot_nedges) / pow(tot_nverts, gamma);
+
+      /* fast pass for standalone case. */
+      standalone = hybrid_rpc.numprocs() == 1;
+      hybrid_rpc.barrier();
+    } // end of constructor
+
+    ~distributed_hybrid_ginger_ingress() { }
+
+    /** Add an edge to the ingress object using random hashing assignment.
+     *  This function acts as the first phase for SNAP graph to deliver edges
+     *  via the hashing value of its target vertex.
+     */
+    void add_edge(vertex_id_type source, vertex_id_type target,
+                  const EdgeData& edata) {
+      const edge_buffer_record record(source, target, edata);
+      const procid_t owning_proc = standalone ? 0 :
+        graph_hash::hash_vertex(target) % hybrid_rpc.numprocs();
+#ifdef _OPENMP
+      hybrid_edge_exchange.send(owning_proc, record, omp_get_thread_num());
+#else
+      hybrid_edge_exchange.send(owning_proc, record);
+#endif
+    } // end of add edge
+
+
+    /* add vdata */
+    void add_vertex(vertex_id_type vid, const VertexData& vdata) { 
+      const vertex_buffer_record record(vid, vdata);
+      const procid_t owning_proc = standalone ? 0 :
+        graph_hash::hash_vertex(vid) % hybrid_rpc.numprocs();
+#ifdef _OPENMP
+      hybrid_vertex_exchange.send(owning_proc, record, omp_get_thread_num());
+#else
+      hybrid_vertex_exchange.send(owning_proc, record);
+#endif
+    } // end of add vertex
+
+
+    /* ginger heuristic for low-degree vertex */
+    procid_t ginger_to_proc (const vertex_id_type target,
+        const std::vector<edge_buffer_record>& records) {
+      size_t nprocs = hybrid_rpc.numprocs();    
+      std::vector<double> proc_score(nprocs);
+      std::vector<int> proc_degrees(nprocs);
+    
+      for (size_t i = 0; i < records.size(); ++i) {
+        if (mht.find(records[i].source) != mht.end())
+          proc_degrees[mht[records[i].source]]++;
+      }
+    
+      for (size_t i = 0; i < nprocs; ++i) {
+        proc_score[i] = proc_degrees[i] 
+                      - alpha * gamma * pow(proc_balance[i], (gamma - 1));
+      }
+    
+      double best_score = proc_score[0];
+      procid_t best_proc = 0;
+      for (size_t i = 1; i < nprocs; ++i) {
+        if (proc_score[i] > best_score) {
+          best_score = proc_score[i];
+          best_proc = i;
+        }
+      }
+
+      return best_proc;
+    };
+
+    /* ginger heuristic for low-degree vertex */
+    void sync_heuristic() {
+      size_t nprocs = hybrid_rpc.numprocs();
+      procid_t l_procid = hybrid_rpc.procid();
+
+      // send proc_score_incr 
+      for (procid_t p = 0; p < nprocs; p++) {
+        for (procid_t i = 0; i < nprocs; i++)
+          if (i != l_procid) 
+            proc_score_exchange.send(i, std::make_pair(p, proc_score_incr[p]));
+        proc_score_incr[p] = 0;
+      }      
+
+      // flush proc_score_incr and mht but w/o spin 
+      proc_score_exchange.partial_flush(0);
+      mht_exchange.partial_flush(0);
+
+
+      // update local mht and proc_balance
+      master_buffer_type master_buffer;
+      procid_t proc = -1;
+      while(mht_exchange.recv(proc, master_buffer, false)) {
+        foreach(const master_pair_type& pair, master_buffer)
+          mht[pair.first] = pair.second;
+      }
+      mht_exchange.clear();
+
+      proc_score_buffer_type proc_edge_buffer;
+      proc = -1;
+      while (proc_score_exchange.recv(proc, proc_edge_buffer, false)) {
+        foreach (const proc_score_pair_type& pair, proc_edge_buffer)
+          proc_balance[pair.first] += pair.second;
+      }
+      proc_score_exchange.clear();
+    }
+
+    void assign_hybrid_edges() {
+      graphlab::timer ti;
+      size_t nprocs = hybrid_rpc.numprocs();
+      procid_t l_procid = hybrid_rpc.procid();      
+      raw_map_type raw_map;
+      size_t vcount = 0;
+
+      // collect edges
+      edge_buffer_type edge_buffer;
+      procid_t proc = -1;
+      while (hybrid_edge_exchange.recv(proc, edge_buffer)) {
+        foreach(const edge_buffer_record& rec, edge_buffer) {
+          raw_map[rec.target].push_back(rec);
+        }
+      }
+      hybrid_edge_exchange.clear();
+
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "collect raw map: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+      logstream(LOG_INFO) << "receive " << raw_map.size()
+                          << " vertices done." << std::endl;
+#endif
+
+
+      //assign vertices and its in-edges to hosting node
+      for (typename raw_map_type::iterator it = raw_map.begin(); 
+          it != raw_map.end(); ++it) {
+        vertex_id_type target = it->first;
+        procid_t owning_proc = 0;
+        size_t degree = it->second.size();
+
+        if (degree > threshold) {
+          // TODO: no need send, just resend latter
+          owning_proc = graph_hash::hash_vertex(target) % nprocs;
+          for (size_t i = 0; i < degree; ++i)
+            high_edge_exchange.send(owning_proc, it->second[i]);
+        } else {
+          owning_proc = ginger_to_proc(target, it->second);
+          for (size_t i = 0; i < degree; ++i)
+            low_edge_exchange.send(owning_proc, it->second[i]);
+
+          // update mht and nedges_incr
+          for (procid_t p = 0; p < nprocs; ++p) {
+            if (p != l_procid)
+              mht_exchange.send(p, master_pair_type(target, owning_proc));
+            else
+              mht[target] = owning_proc;
+          }
+
+          // adjust balance according to vertex and edge
+          proc_balance[owning_proc]++;
+          proc_balance[owning_proc] += 
+              (degree * float(tot_nverts) / float(tot_nedges));
+
+          proc_score_incr[owning_proc]++;
+          proc_score_incr[owning_proc] += 
+              (degree * float(tot_nverts) / float(tot_nedges));
+        }
+
+        // periodical synchronize heurisitic
+        if ((++vcount % interval) == 0) sync_heuristic();
+      }
+
+      // last synchronize on mht
+      mht_exchange.flush();
+      master_buffer_type master_buffer;
+      proc = -1;
+      while(mht_exchange.recv(proc, master_buffer)) {
+        foreach(const master_pair_type& pair, master_buffer)
+          mht[pair.first] = pair.second;
+      }
+      mht_exchange.clear();
+
+
+#ifdef TUNING
+      //logstream(LOG_INFO) << "balance[";
+      //for (procid_t i = 0; i < nprocs; i++)
+      //  logstream(LOG_INFO) << proc_balance[i] << ",";
+      //logstream(LOG_INFO) << "] ";
+      logstream(LOG_INFO) << "nsyncs(" << (vcount / interval) 
+                          << ") using " << ti.current_time() << " secs "
+                          << "#mht=" << mht.size()
+                          << std::endl;
+#endif
+    }
+
+    void finalize() {
+      graphlab::timer ti;
+
+      size_t nprocs = hybrid_rpc.numprocs();
+      procid_t l_procid = hybrid_rpc.procid();
+      size_t nedges = 0;
+
+      hybrid_rpc.full_barrier();
+
+      if (l_procid == 0) {
+        memory_info::log_usage("start finalizing");
+        logstream(LOG_EMPH) << "ginger finalizing ..."
+                            << " #vertices=" << graph.local_graph.num_vertices()
+                            << " #edges=" << graph.local_graph.num_edges()
+                            << " threshold=" << threshold
+                            << " interval=" << interval
+                            << " gamma=" << gamma
+                            << " alpha=" << alpha
+                            << std::endl;
+      }
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                       Flush any additional data                        */
+      /*                                                                        */
+      /**************************************************************************/
+      hybrid_edge_exchange.flush(); hybrid_vertex_exchange.flush();
+
+      /**
+       * Fast pass for redundant finalization with no graph changes. 
+       */
+      {
+        size_t changed_size = hybrid_edge_exchange.size() + hybrid_vertex_exchange.size();
+        hybrid_rpc.all_reduce(changed_size);
+        if (changed_size == 0) {
+          logstream(LOG_INFO) << "Skipping Graph Finalization because no changes happened..." << std::endl;
+          return;
+        }
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                             Assign edges                             */
+      /*                                                                        */
+      /**************************************************************************/
+      if (!standalone) assign_hybrid_edges();
+      
+#ifdef TUNING
+      if(l_procid == 0) { 
+        logstream(LOG_INFO) << "assign edges: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+      
+      /**************************************************************************/
+      /*                                                                        */
+      /*                       Prepare hybrid ingress                           */
+      /*                                                                        */
+      /**************************************************************************/
+      if (standalone) { /* fast pass for standalone */
+        edge_buffer_type edge_buffer;
+        procid_t proc = -1;
+        nedges = hybrid_edge_exchange.size();
+        
+        while(hybrid_edge_exchange.recv(proc, edge_buffer)) {
+          foreach(const edge_buffer_record& rec, edge_buffer)
+            hybrid_edges.push_back(rec);
+        }
+        hybrid_edge_exchange.clear();
+      } else {
+        high_edge_exchange.flush(); low_edge_exchange.flush();
+
+        nedges = low_edge_exchange.size();
+        hybrid_edges.reserve(nedges + high_edge_exchange.size());
+
+        edge_buffer_type edge_buffer;
+        procid_t proc = -1;
+        while(low_edge_exchange.recv(proc, edge_buffer)) {
+          foreach(const edge_buffer_record& rec, edge_buffer) {
+            if (mht.find(rec.source) == mht.end())
+              mht[rec.source] = graph_hash::hash_vertex(rec.source) % nprocs;   
+
+            hybrid_edges.push_back(rec);
+          }
+        }
+        low_edge_exchange.clear();
+
+#ifdef TUNING
+        if(l_procid == 0) { 
+          logstream(LOG_INFO) << "low-degree edges: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+
+
+        // re-send edges of high-degree vertices by hybrid_edge_exchange
+        proc = -1;
+        while(high_edge_exchange.recv(proc, edge_buffer)) {
+          foreach(const edge_buffer_record& rec, edge_buffer) {
+            if (mht.find(rec.source) == mht.end())
+              mht[rec.source] = graph_hash::hash_vertex(rec.source) % nprocs; 
+
+            const procid_t owner_proc = mht[rec.source];
+            if (owner_proc == l_procid) {
+              hybrid_edges.push_back(rec);
+              ++nedges;
+            } else {
+              hybrid_edge_exchange.send(owner_proc, rec);
+            }
+          }
+        }
+        high_edge_exchange.clear();
+
+        // receive edges of high-degree vertices
+        hybrid_edge_exchange.flush();        
+#ifdef TUNING
+        logstream(LOG_INFO) << "receive #edges=" << hybrid_edge_exchange.size()
+                            << std::endl;
+#endif
+        proc = -1;
+        while(hybrid_edge_exchange.recv(proc, edge_buffer)) {
+          foreach(const edge_buffer_record& rec, edge_buffer) {
+            mht[rec.source] = l_procid;
+            hybrid_edges.push_back(rec);
+            ++nedges;
+          }
+        }
+        hybrid_edge_exchange.clear();
+      }
+
+      if(l_procid == 0) {
+        memory_info::log_usage("prepare ginger finalizing done.");
+        logstream(LOG_EMPH) << "prepare ginger finalizing. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+
+      // connect to base finalize()
+      modified_base_finalize(nedges);
+
+      // set vertex degree type for hybrid engine
+      set_degree_type();
+      
+      if(l_procid == 0) {
+        logstream(LOG_EMPH) << "ginger finalizing graph. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+    } // end of finalize
+
+    void set_degree_type() {
+      graphlab::timer ti;
+      procid_t l_procid = hybrid_rpc.procid();
+      size_t high_master = 0, high_mirror = 0, low_master = 0, low_mirror = 0;
+
+      for (size_t lvid = 0; lvid < graph.num_local_vertices(); lvid++) {
+        vertex_record& vrec = graph.lvid2record[lvid];
+        if (vrec.num_in_edges > threshold) {
+          vrec.dtype = graph_type::HIGH; 
+          if (vrec.owner == l_procid) high_master ++;
+          else high_mirror ++;
+        } else {
+          vrec.dtype = graph_type::LOW;
+          if (vrec.owner == l_procid) low_master ++;
+          else low_mirror ++;
+        }        
+      }
+
+#ifdef TUNING
+      // Compute the total number of high-degree and low-degree vertices
+      std::vector<size_t> swap_counts(hybrid_rpc.numprocs());
+
+      swap_counts[l_procid] = high_master;
+      hybrid_rpc.all_gather(swap_counts);
+      high_master = 0;
+      foreach(size_t count, swap_counts) high_master += count;
+
+      swap_counts[l_procid] = high_mirror;
+      hybrid_rpc.all_gather(swap_counts);
+      high_mirror = 0;
+      foreach(size_t count, swap_counts) high_mirror += count;
+
+      swap_counts[l_procid] = low_master;
+      hybrid_rpc.all_gather(swap_counts);
+      low_master = 0;
+      foreach(size_t count, swap_counts) low_master += count;
+
+      swap_counts[l_procid] = low_mirror;
+      hybrid_rpc.all_gather(swap_counts);
+      low_mirror = 0;
+      foreach(size_t count, swap_counts) low_mirror += count;
+
+      if(l_procid == 0) {
+        logstream(LOG_EMPH) << "hybrid info: master [" 
+                            << high_master << " " 
+                            << low_master << " " 
+                            << (float(high_master)/(high_master+low_master)) << "]"
+                            << std::endl;
+        if ((high_mirror + low_mirror) > 0)
+          logstream(LOG_EMPH) << "hybrid info: mirror [" 
+                              << high_mirror << " " 
+                              << low_mirror << " " 
+                              << (float(high_mirror)/(high_mirror+low_mirror)) << "]"
+                              << std::endl;
+
+        memory_info::log_usage("set vertex type done."); 
+        logstream(LOG_EMPH) << "set vertex type: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+    }
+
+
+    /* do the same job as original base finalize except for
+     * extracting edges from hybrid_edges instead of original edge_buffer;
+     * and using mht to tracing the master location of each vertex.
+     */
+    void modified_base_finalize(size_t nedges) {
+      graphlab::timer ti;
+      procid_t l_procid = hybrid_rpc.procid();
+      size_t nprocs = hybrid_rpc.numprocs();
+
+      hybrid_rpc.full_barrier();
+      
+      bool first_time_finalize = false;
+      /**
+       * Fast pass for first time finalization. 
+       */
+      if (graph.is_dynamic()) {
+        size_t nverts = graph.num_local_vertices();
+        hybrid_rpc.all_reduce(nverts);
+        first_time_finalize = (nverts == 0);
+      } else {
+        first_time_finalize = false;
+      }
+
+      
+      typedef typename hopscotch_map<vertex_id_type, lvid_type>::value_type
+          vid2lvid_pair_type;
+
+      /**
+       * \internal
+       * Buffer storage for new vertices to the local graph.
+       */
+      typedef typename graph_type::hopscotch_map_type vid2lvid_map_type;
+      vid2lvid_map_type vid2lvid_buffer;
+
+      /**
+       * \internal
+       * The begining id assinged to the first new vertex.
+       */
+      const lvid_type lvid_start  = graph.vid2lvid.size();
+
+      /**
+       * \internal
+       * Bit field incidate the vertex that is updated during the ingress. 
+       */
+      dense_bitset updated_lvids(graph.vid2lvid.size());
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                         Construct local graph                          */
+      /*                                                                        */
+      /**************************************************************************/
+      { // Add all the edges to the local graph
+        graph.local_graph.reserve_edge_space(nedges + 1);
+
+        foreach(const edge_buffer_record& rec, hybrid_edges) {
+          // Get the source_vlid;
+          lvid_type source_lvid(-1);
+          if(graph.vid2lvid.find(rec.source) == graph.vid2lvid.end()) {
+            if (vid2lvid_buffer.find(rec.source) == vid2lvid_buffer.end()) {
+              source_lvid = lvid_start + vid2lvid_buffer.size();
+              vid2lvid_buffer[rec.source] = source_lvid;
+            } else {
+              source_lvid = vid2lvid_buffer[rec.source];
+            }
+          } else {
+            source_lvid = graph.vid2lvid[rec.source];
+            updated_lvids.set_bit(source_lvid);
+          }
+          // Get the target_lvid;
+          lvid_type target_lvid(-1);
+          if(graph.vid2lvid.find(rec.target) == graph.vid2lvid.end()) {
+            if (vid2lvid_buffer.find(rec.target) == vid2lvid_buffer.end()) {                
+              target_lvid = lvid_start + vid2lvid_buffer.size();
+              vid2lvid_buffer[rec.target] = target_lvid;
+            } else {
+              target_lvid = vid2lvid_buffer[rec.target];
+            }
+          } else {
+            target_lvid = graph.vid2lvid[rec.target];
+            updated_lvids.set_bit(target_lvid);
+          }
+          graph.local_graph.add_edge(source_lvid, target_lvid, rec.edata);
+        } // end for loop over buffers
+        hybrid_edges.clear();
+
+        ASSERT_EQ(graph.vid2lvid.size() + vid2lvid_buffer.size(), 
+                  graph.local_graph.num_vertices());
+#ifdef TUNING
+        if(l_procid == 0)  {
+          logstream(LOG_INFO) << "populating local graph: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+        // Finalize local graph
+        graph.local_graph.finalize();
+#ifdef TUNING
+        logstream(LOG_INFO) << "local graph info: " << std::endl
+                            << "\t nverts: " << graph.local_graph.num_vertices()
+                            << std::endl
+                            << "\t nedges: " << graph.local_graph.num_edges()
+                            << std::endl;
+        
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "finalizing local graph: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*             Receive and add vertex data to masters                     */
+      /*                                                                        */
+      /**************************************************************************/
+      // Setup the map containing all the vertices being negotiated by this machine
+      {
+        if (standalone) {
+          vertex_buffer_type vertex_buffer;
+          procid_t proc = -1;
+          while(hybrid_vertex_exchange.recv(proc, vertex_buffer)) {
+            foreach(const vertex_buffer_record& rec, vertex_buffer) {
+              lvid_type lvid(-1);
+              if (graph.vid2lvid.find(rec.vid) == graph.vid2lvid.end()) {
+                if (vid2lvid_buffer.find(rec.vid) == vid2lvid_buffer.end()) {
+                  lvid = lvid_start + vid2lvid_buffer.size();
+                  vid2lvid_buffer[rec.vid] = lvid;
+                } else {
+                  lvid = vid2lvid_buffer[rec.vid];
+                }
+              } else {
+                lvid = graph.vid2lvid[rec.vid];
+                updated_lvids.set_bit(lvid);
+              }
+              if (distributed_hybrid_ginger_ingress::vertex_combine_strategy 
+                  && lvid < graph.num_local_vertices()) {
+                distributed_hybrid_ginger_ingress::vertex_combine_strategy(
+                  graph.l_vertex(lvid).data(), rec.vdata);
+              } else {
+                graph.local_graph.add_vertex(lvid, rec.vdata);
+              }
+            }
+          }
+          hybrid_vertex_exchange.clear();
+        } 
+        else {
+          // re-send by trampoline
+          vertex_buffer_type vertex_buffer;
+          procid_t proc = -1;
+          while (hybrid_vertex_exchange.recv(proc, vertex_buffer)) {
+            foreach (const vertex_buffer_record& rec, vertex_buffer) {
+              if (mht.find(rec.vid) == mht.end())
+                mht[rec.vid] = graph_hash::hash_vertex(rec.vid) % nprocs; 
+              resend_vertex_exchange.send(mht[rec.vid], rec);
+            }
+          }
+          hybrid_vertex_exchange.clear();
+
+          // receive vertex data re-sent by other machines            
+          resend_vertex_exchange.flush();
+          proc = -1;
+          while(resend_vertex_exchange.recv(proc, vertex_buffer)) {
+            foreach(const vertex_buffer_record& rec, vertex_buffer) {
+              lvid_type lvid(-1);
+              if (graph.vid2lvid.find(rec.vid) == graph.vid2lvid.end()) {
+                if (vid2lvid_buffer.find(rec.vid) == vid2lvid_buffer.end()) {
+                  lvid = lvid_start + vid2lvid_buffer.size();
+                  vid2lvid_buffer[rec.vid] = lvid;
+                } else {
+                  lvid = vid2lvid_buffer[rec.vid];
+                }
+              } else {
+                lvid = graph.vid2lvid[rec.vid];
+                updated_lvids.set_bit(lvid);
+              }
+              if (distributed_hybrid_ginger_ingress::vertex_combine_strategy 
+                  && lvid < graph.num_local_vertices()) {
+                distributed_hybrid_ginger_ingress::vertex_combine_strategy(
+                  graph.l_vertex(lvid).data(), rec.vdata);
+              } else {
+                graph.local_graph.add_vertex(lvid, rec.vdata);
+              }
+            }
+          }
+          resend_vertex_exchange.clear();
+        }
+
+#ifdef TUNING
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "adding vertex data: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif        
+      } // end of loop to populate vrecmap
+
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*        Assign vertex data and allocate vertex (meta)data  space        */
+      /*                                                                        */
+      /**************************************************************************/
+      {
+        // determine masters for all negotiated vertices
+        const size_t local_nverts = graph.vid2lvid.size() + vid2lvid_buffer.size();
+        graph.lvid2record.reserve(local_nverts);
+        graph.lvid2record.resize(local_nverts);
+        graph.local_graph.resize(local_nverts);
+        foreach(const vid2lvid_pair_type& pair, vid2lvid_buffer) {
+          vertex_record& vrec = graph.lvid2record[pair.second];
+          vrec.gvid = pair.first;
+          if (standalone) {
+            vrec.owner = 0;
+          } else {
+            if (mht.find(pair.first) == mht.end())
+              mht[pair.first] = graph_hash::hash_vertex(pair.first) % nprocs; 
+            vrec.owner = mht[pair.first];
+          }
+        }
+        ASSERT_EQ(local_nverts, graph.local_graph.num_vertices());
+        ASSERT_EQ(graph.lvid2record.size(), graph.local_graph.num_vertices());
+#ifdef TUNING
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "allocating lvid2record: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+        mht.clear();
+      }
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                          Master handshake                              */
+      /*                                                                        */
+      /**************************************************************************/      
+      if (!standalone) {
+#ifdef _OPENMP
+        buffered_exchange<vertex_id_type> vid_buffer(hybrid_rpc.dc(), omp_get_max_threads());
+#else
+        buffered_exchange<vertex_id_type> vid_buffer(hybrid_rpc.dc());
+#endif
+
+#ifdef _OPENMP
+#pragma omp parallel for
+#endif
+        // send not owned vids to their master
+        for (lvid_type i = lvid_start; i < graph.lvid2record.size(); ++i) {
+          procid_t master = graph.lvid2record[i].owner;
+          if (master != l_procid)
+#ifdef _OPENMP
+            vid_buffer.send(master, graph.lvid2record[i].gvid, omp_get_thread_num());
+#else
+            vid_buffer.send(master, graph.lvid2record[i].gvid);
+#endif
+        }
+        vid_buffer.flush();
+        hybrid_rpc.barrier();
+
+        // receive all vids owned by me
+        mutex flying_vids_lock;
+        boost::unordered_map<vertex_id_type, mirror_type> flying_vids;
+#ifdef _OPENMP
+#pragma omp parallel
+#endif
+        {
+          typename buffered_exchange<vertex_id_type>::buffer_type buffer;
+          procid_t recvid = -1;
+          while(vid_buffer.recv(recvid, buffer)) {
+            foreach(const vertex_id_type vid, buffer) {
+              if (graph.vid2lvid.find(vid) == graph.vid2lvid.end()) {
+                if (vid2lvid_buffer.find(vid) == vid2lvid_buffer.end()) {
+                  flying_vids_lock.lock();
+                  mirror_type& mirrors = flying_vids[vid];
+                  mirrors.set_bit(recvid);
+                  flying_vids_lock.unlock();
+                } else {
+                  lvid_type lvid = vid2lvid_buffer[vid];
+                  graph.lvid2record[lvid]._mirrors.set_bit(recvid);
+                }
+              } else {
+                lvid_type lvid = graph.vid2lvid[vid];
+                graph.lvid2record[lvid]._mirrors.set_bit(recvid);
+                updated_lvids.set_bit(lvid);
+              }
+            }
+          }
+        }
+        vid_buffer.clear();
+
+        if (!flying_vids.empty()) {
+          logstream(LOG_INFO) << "#flying-own-nverts="
+                              << flying_vids.size() 
+                              << std::endl;
+
+          // reallocate spaces for the flying vertices. 
+          size_t vsize_old = graph.lvid2record.size();
+          size_t vsize_new = vsize_old + flying_vids.size();
+          graph.lvid2record.resize(vsize_new);
+          graph.local_graph.resize(vsize_new);
+          for (typename boost::unordered_map<vertex_id_type, mirror_type>::iterator it = flying_vids.begin();
+               it != flying_vids.end(); ++it) {
+            lvid_type lvid = lvid_start + vid2lvid_buffer.size();
+            vertex_record& vrec = graph.lvid2record[lvid];
+            vertex_id_type gvid = it->first;
+            vrec.owner = l_procid;
+            vrec.gvid = gvid;
+            vrec._mirrors = it->second;
+            vid2lvid_buffer[gvid] = lvid;
+          }
+        }
+      } // end of master handshake
+
+#ifdef TUNING
+      if(l_procid == 0) {
+        logstream(LOG_INFO) << "master handshake: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                        Merge in vid2lvid_buffer                        */
+      /*                                                                        */
+      /**************************************************************************/
+      {
+        if (graph.vid2lvid.size() == 0) {
+          graph.vid2lvid.swap(vid2lvid_buffer);
+        } else {
+          graph.vid2lvid.rehash(graph.vid2lvid.size() + vid2lvid_buffer.size());
+          foreach (const typename vid2lvid_map_type::value_type& pair, vid2lvid_buffer) {
+            graph.vid2lvid.insert(pair);
+          }
+          vid2lvid_buffer.clear();
+        }
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*              Synchronize vertex data and meta information              */
+      /*                                                                        */
+      /**************************************************************************/
+      // TODO:  optimization for standalone
+      {
+        // construct the vertex set of changed vertices
+
+        // Fast pass for first time finalize;
+        vertex_set changed_vset(true);
+
+        // Compute the vertices that needs synchronization 
+        if (!first_time_finalize) {
+          vertex_set changed_vset = vertex_set(false);
+          changed_vset.make_explicit(graph);
+          updated_lvids.resize(graph.num_local_vertices());
+          for (lvid_type i = lvid_start; i <  graph.num_local_vertices(); ++i) {
+            updated_lvids.set_bit(i);
+          }
+          changed_vset.localvset = updated_lvids; 
+          buffered_exchange<vertex_id_type> vset_exchange(hybrid_rpc.dc());
+          // sync vset with all mirrors
+          changed_vset.synchronize_mirrors_to_master_or(graph, vset_exchange);
+          changed_vset.synchronize_master_to_mirrors(graph, vset_exchange);
+        }
+
+        graphlab::graph_gather_apply<graph_type, vertex_negotiator_record> 
+            vrecord_sync_gas(graph, 
+                             boost::bind(&distributed_hybrid_ginger_ingress::finalize_gather, this, _1, _2), 
+                             boost::bind(&distributed_hybrid_ginger_ingress::finalize_apply, this, _1, _2, _3));
+        vrecord_sync_gas.exec(changed_vset);
+
+#ifdef TUNING
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "synchrionizing vertex (meta)data: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+      }
+
+      base_type::exchange_global_info(standalone);
+#ifdef TUNING
+      if(l_procid == 0) {
+        logstream(LOG_INFO) << "exchange global info: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+      if(l_procid == 0) {
+        memory_info::log_usage("base finalizing done.");
+        logstream(LOG_EMPH) << "base finalizing. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+    } // end of modified base finalize
+
+  private:
+    boost::function<void(vertex_data_type&, const vertex_data_type&)> vertex_combine_strategy;
+
+    /**
+     * \brief Gather the vertex distributed meta data.
+     */
+    vertex_negotiator_record finalize_gather(lvid_type& lvid, graph_type& graph) {
+      vertex_negotiator_record accum;
+      accum.num_in_edges = graph.local_graph.num_in_edges(lvid);
+      accum.num_out_edges = graph.local_graph.num_out_edges(lvid);
+      if (graph.l_is_master(lvid)) {
+        accum.has_data = true;
+        accum.vdata = graph.l_vertex(lvid).data();
+        accum.mirrors = graph.lvid2record[lvid]._mirrors;
+      }
+      return accum;
+    }
+
+    /**
+     * \brief Update the vertex data structures with the gathered vertex metadata.  
+     */
+    void finalize_apply(lvid_type lvid, const vertex_negotiator_record& accum, graph_type& graph) {
+      typename graph_type::vertex_record& vrec = graph.lvid2record[lvid];
+      vrec.num_in_edges = accum.num_in_edges;
+      vrec.num_out_edges = accum.num_out_edges;
+      graph.l_vertex(lvid).data() = accum.vdata;
+      vrec._mirrors = accum.mirrors;    
+    }
+  }; // end of distributed_hybrid_ginger_ingress
+}; // end of namespace graphlab
+#include <graphlab/macros_undef.hpp>
+
+
+#endif
diff --git a/src/graphlab/graph/ingress/distributed_hybrid_ingress.hpp b/src/graphlab/graph/ingress/distributed_hybrid_ingress.hpp
new file mode 100644
index 0000000000..520c90fd9e
--- /dev/null
+++ b/src/graphlab/graph/ingress/distributed_hybrid_ingress.hpp
@@ -0,0 +1,771 @@
+/*  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * For more about this software visit:
+ *
+ *      http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
+ *
+ *
+ * 2013.11  implement hybrid random partitioning
+ *
+ */
+
+
+#ifndef GRAPHLAB_DISTRIBUTED_HYBRID_INGRESS_HPP
+#define GRAPHLAB_DISTRIBUTED_HYBRID_INGRESS_HPP
+
+#include <boost/functional/hash.hpp>
+
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/graph/graph_basic_types.hpp>
+#include <graphlab/graph/ingress/distributed_ingress_base.hpp>
+#include <graphlab/graph/distributed_graph.hpp>
+#include <graphlab/graph/graph_hash.hpp>
+#include <graphlab/rpc/buffered_exchange.hpp>
+#include <graphlab/rpc/distributed_event_log.hpp>
+#include <graphlab/parallel/pthread_tools.hpp>
+#include <graphlab/logger/logger.hpp>
+#include <vector>
+
+#include <graphlab/macros_def.hpp>
+
+#define TUNING
+namespace graphlab {
+  template<typename VertexData, typename EdgeData>
+  class distributed_graph;
+
+  /**
+   * \brief Ingress object assigning edges using a hybrid method.
+   *        That is, for high degree edge, hashing from its source vertex;
+   *        for low degree edge, hashing from its target vertex.
+   */
+  template<typename VertexData, typename EdgeData>
+  class distributed_hybrid_ingress : 
+    public distributed_ingress_base<VertexData, EdgeData> {
+  public:
+    typedef distributed_graph<VertexData, EdgeData> graph_type;
+    /// The type of the vertex data stored in the graph 
+    typedef VertexData vertex_data_type;
+    /// The type of the edge data stored in the graph 
+    typedef EdgeData   edge_data_type;
+
+    typedef distributed_ingress_base<VertexData, EdgeData> base_type;
+
+    typedef typename graph_type::vertex_record vertex_record;
+    typedef typename graph_type::mirror_type mirror_type;
+
+    typedef typename buffered_exchange<vertex_id_type>::buffer_type
+        vertex_id_buffer_type;
+
+    /// The rpc interface for this object
+    dc_dist_object<distributed_hybrid_ingress> hybrid_rpc;
+    /// The underlying distributed graph object that is being loaded
+    graph_type& graph;
+    
+    /// threshold to divide high-degree and low-degree vertices
+    size_t threshold;
+
+    bool standalone;
+
+    typedef typename base_type::edge_buffer_record edge_buffer_record;
+    typedef typename buffered_exchange<edge_buffer_record>::buffer_type 
+        edge_buffer_type;
+
+    typedef typename base_type::vertex_buffer_record vertex_buffer_record;
+    typedef typename buffered_exchange<vertex_buffer_record>::buffer_type 
+        vertex_buffer_type;
+
+    std::vector<edge_buffer_record> hybrid_edges;
+
+    /* ingress exchange */
+    buffered_exchange<edge_buffer_record> hybrid_edge_exchange;
+    buffered_exchange<vertex_buffer_record> hybrid_vertex_exchange;
+
+    /// detail vertex record for the second pass coordination. 
+    typedef typename base_type::vertex_negotiator_record 
+      vertex_negotiator_record;
+
+  public:
+    distributed_hybrid_ingress(distributed_control& dc, 
+        graph_type& graph, size_t threshold = 100) :
+        base_type(dc, graph), hybrid_rpc(dc, this), 
+        graph(graph), threshold(threshold),
+#ifdef _OPENMP
+        hybrid_edge_exchange(dc, omp_get_max_threads()), 
+        hybrid_vertex_exchange(dc, omp_get_max_threads())
+#else
+        hybrid_edge_exchange(dc), 
+        hybrid_vertex_exchange(dc)
+#endif
+    {
+      /* fast pass for standalone case. */
+      standalone = hybrid_rpc.numprocs() == 1;
+      hybrid_rpc.barrier();
+    } // end of constructor
+
+    ~distributed_hybrid_ingress() { }
+
+    /** Add an edge to the ingress object using random hashing assignment.
+     *  This function acts as the first phase for SNAP graph to deliver edges
+     *  via the hashing value of its target vertex.
+     */
+    void add_edge(vertex_id_type source, vertex_id_type target,
+                  const EdgeData& edata) {
+      const edge_buffer_record record(source, target, edata);      
+      const procid_t owning_proc = standalone ? 0 :
+        graph_hash::hash_vertex(target) % hybrid_rpc.numprocs();
+#ifdef _OPENMP
+      hybrid_edge_exchange.send(owning_proc, record, omp_get_thread_num());
+#else
+      hybrid_edge_exchange.send(owning_proc, record);
+#endif
+    } // end of add edge
+
+
+    /* add vdata */
+    void add_vertex(vertex_id_type vid, const VertexData& vdata) { 
+      const vertex_buffer_record record(vid, vdata);
+      const procid_t owning_proc = standalone ? 0 :
+        graph_hash::hash_vertex(vid) % hybrid_rpc.numprocs();        
+#ifdef _OPENMP
+      hybrid_vertex_exchange.send(owning_proc, record, omp_get_thread_num());
+#else
+      hybrid_vertex_exchange.send(owning_proc, record);
+#endif
+    } // end of add vertex
+
+
+    void finalize() {
+      
+      graphlab::timer ti;
+      
+      size_t nprocs = hybrid_rpc.numprocs();
+      procid_t l_procid = hybrid_rpc.procid();
+      size_t nedges = 0;
+
+      hybrid_rpc.full_barrier();
+
+      if (l_procid == 0) {
+        memory_info::log_usage("start finalizing");
+        logstream(LOG_EMPH) << "hybrid finalizing ..."
+                            << " #vertices=" << graph.local_graph.num_vertices()
+                            << " #edges=" << graph.local_graph.num_edges()
+                            << " threshold=" << threshold
+                            << std::endl;
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                       Flush any additional data                        */
+      /*                                                                        */
+      /**************************************************************************/
+      hybrid_edge_exchange.flush(); hybrid_vertex_exchange.flush();
+
+      /**
+       * Fast pass for redundant finalization with no graph changes. 
+       */
+      {
+        size_t changed_size = hybrid_edge_exchange.size() + hybrid_vertex_exchange.size();
+        hybrid_rpc.all_reduce(changed_size);
+        if (changed_size == 0) {
+          logstream(LOG_INFO) << "Skipping Graph Finalization because no changes happened..." << std::endl;
+          return;
+        }
+      }
+      
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                       Prepare hybrid ingress                           */
+      /*                                                                        */
+      /**************************************************************************/ 
+      {
+        edge_buffer_type edge_buffer;
+        procid_t proc;
+        nedges = hybrid_edge_exchange.size();
+
+        hybrid_edges.reserve(nedges);
+        if (standalone) { /* fast pass for standalone */
+          proc = -1;
+          while(hybrid_edge_exchange.recv(proc, edge_buffer))
+            foreach(const edge_buffer_record& rec, edge_buffer)
+              hybrid_edges.push_back(rec);          
+          hybrid_edge_exchange.clear();
+        } else {
+          hopscotch_map<vertex_id_type, size_t> in_degree_set;
+
+          proc = -1;
+          while(hybrid_edge_exchange.recv(proc, edge_buffer)) {
+            foreach(const edge_buffer_record& rec, edge_buffer) {
+              hybrid_edges.push_back(rec);
+              in_degree_set[rec.target]++;
+            }
+          }
+          hybrid_edge_exchange.clear();
+          hybrid_edge_exchange.barrier(); // barrier before reusing
+#ifdef TUNING
+          if(l_procid == 0) {
+            logstream(LOG_INFO) << "save local edges and count in-degree: " 
+                                << ti.current_time()
+                                << " secs" 
+                                << std::endl;
+          }
+#endif
+
+          // re-send edges of high-degree vertices
+          for (size_t i = 0; i < hybrid_edges.size(); i++) {
+            edge_buffer_record& rec = hybrid_edges[i];
+            if (in_degree_set[rec.target] > threshold) {
+              const procid_t source_owner_proc = 
+                graph_hash::hash_vertex(rec.source) % nprocs;
+              if(source_owner_proc != l_procid){
+                // re-send the edge of high-degree vertices according to source
+                hybrid_edge_exchange.send(source_owner_proc, rec);
+                // set re-sent edges as empty for skipping
+                hybrid_edges[i] = edge_buffer_record();
+                --nedges;
+              }
+            }
+          }
+#ifdef TUNING
+          if(l_procid == 0) {
+            logstream(LOG_INFO) << "resend edges of high-degree vertices: " 
+                                << ti.current_time()
+                                << " secs" 
+                                << std::endl;
+          }
+#endif
+
+          // receive edges of high-degree vertices
+          hybrid_edge_exchange.flush();
+#ifdef TUNING
+          logstream(LOG_INFO) << "receive high-degree edges: "  
+                              << hybrid_edge_exchange.size() << std::endl;
+#endif
+          proc = -1;
+          while(hybrid_edge_exchange.recv(proc, edge_buffer)) {
+            foreach(const edge_buffer_record& rec, edge_buffer) {
+              hybrid_edges.push_back(rec);
+              ++nedges;
+            }
+          }
+          hybrid_edge_exchange.clear();
+          in_degree_set.clear();
+#ifdef TUNING
+          if(l_procid == 0) {
+            logstream(LOG_INFO) << "receive high-degree edges: " 
+                                << ti.current_time()
+                                << " secs" 
+                                << std::endl;
+          }
+#endif
+        }
+      }
+
+      if(l_procid == 0) {
+        memory_info::log_usage("prepare hybrid finalizing done.");
+        logstream(LOG_EMPH) << "prepare hybrid finalizing. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+      
+      // connect to base finalize()
+      modified_base_finalize(nedges);
+      
+      // set vertex degree type for hybrid engine
+      set_degree_type();
+
+      if(l_procid == 0) {
+        memory_info::log_usage("hybrid finalizing graph done.");
+        logstream(LOG_EMPH) << "hybrid finalizing graph. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+    } // end of finalize
+
+    void set_degree_type() {
+      graphlab::timer ti;
+      procid_t l_procid = hybrid_rpc.procid();
+      size_t high_master = 0, high_mirror = 0, low_master = 0, low_mirror = 0;
+      
+      for (size_t lvid = 0; lvid < graph.num_local_vertices(); lvid++) {
+        vertex_record& vrec = graph.lvid2record[lvid];
+        if (vrec.num_in_edges > threshold) {
+          vrec.dtype = graph_type::HIGH; 
+          if (vrec.owner == l_procid) high_master ++;
+          else high_mirror ++;
+        } else {
+          vrec.dtype = graph_type::LOW; 
+          if (vrec.owner == l_procid) low_master ++;
+          else low_mirror ++;
+        }        
+      }
+
+#ifdef TUNING
+      // Compute the total number of high-degree and low-degree vertices
+      std::vector<size_t> swap_counts(hybrid_rpc.numprocs());
+
+      swap_counts[l_procid] = high_master;
+      hybrid_rpc.all_gather(swap_counts);
+      high_master = 0;
+      foreach(size_t count, swap_counts) high_master += count;
+
+      swap_counts[l_procid] = high_mirror;
+      hybrid_rpc.all_gather(swap_counts);
+      high_mirror = 0;
+      foreach(size_t count, swap_counts) high_mirror += count;
+
+      swap_counts[l_procid] = low_master;
+      hybrid_rpc.all_gather(swap_counts);
+      low_master = 0;
+      foreach(size_t count, swap_counts) low_master += count;
+
+      swap_counts[l_procid] = low_mirror;
+      hybrid_rpc.all_gather(swap_counts);
+      low_mirror = 0;
+      foreach(size_t count, swap_counts) low_mirror += count;
+
+      if(l_procid == 0) {
+        logstream(LOG_EMPH) << "hybrid info: master [" 
+                            << high_master << " " 
+                            << low_master << " " 
+                            << (float(high_master)/(high_master+low_master)) << "]"
+                            << std::endl;
+        if ((high_mirror + low_mirror) > 0)
+        logstream(LOG_EMPH) << "hybrid info: mirror [" 
+                            << high_mirror << " " 
+                            << low_mirror << " " 
+                            << (float(high_mirror)/(high_mirror+low_mirror)) << "]"
+                            << std::endl;
+
+        memory_info::log_usage("set vertex type done."); 
+        logstream(LOG_EMPH) << "set vertex type: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+    }
+
+
+    /* 
+     * do the same job as original base finalize except for
+     * extracting edges from hybrid_edges instead of original edge_buffer
+     */
+    void modified_base_finalize(size_t nedges) {
+      graphlab::timer ti;
+      procid_t l_procid = hybrid_rpc.procid();
+      size_t nprocs = hybrid_rpc.numprocs();
+      
+      hybrid_rpc.full_barrier();
+      
+      bool first_time_finalize = false;
+      /**
+       * Fast pass for first time finalization. 
+       */
+      if (graph.is_dynamic()) {
+        size_t nverts = graph.num_local_vertices();
+        hybrid_rpc.all_reduce(nverts);
+        first_time_finalize = (nverts == 0);
+      } else {
+        first_time_finalize = false;
+      }
+
+      
+      typedef typename hopscotch_map<vertex_id_type, lvid_type>::value_type
+          vid2lvid_pair_type;
+
+      /**
+       * \internal
+       * Buffer storage for new vertices to the local graph.
+       */
+      typedef typename graph_type::hopscotch_map_type vid2lvid_map_type;
+      vid2lvid_map_type vid2lvid_buffer;
+
+      /**
+       * \internal
+       * The begining id assinged to the first new vertex.
+       */
+      const lvid_type lvid_start  = graph.vid2lvid.size();
+
+      /**
+       * \internal
+       * Bit field incidate the vertex that is updated during the ingress. 
+       */
+      dense_bitset updated_lvids(graph.vid2lvid.size());
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                         Construct local graph                          */
+      /*                                                                        */
+      /**************************************************************************/
+      { // Add all the edges to the local graph
+        graph.local_graph.reserve_edge_space(nedges + 1);
+
+        foreach(const edge_buffer_record& rec, hybrid_edges) {
+          // skip re-sent edges
+          if (rec.source == vertex_id_type(-1)) continue;
+
+          // Get the source_vlid;
+          lvid_type source_lvid(-1);
+          if(graph.vid2lvid.find(rec.source) == graph.vid2lvid.end()) {
+            if (vid2lvid_buffer.find(rec.source) == vid2lvid_buffer.end()) {
+              source_lvid = lvid_start + vid2lvid_buffer.size();
+              vid2lvid_buffer[rec.source] = source_lvid;
+            } else {
+              source_lvid = vid2lvid_buffer[rec.source];
+            }
+          } else {
+            source_lvid = graph.vid2lvid[rec.source];
+            updated_lvids.set_bit(source_lvid);
+          }
+          // Get the target_lvid;
+          lvid_type target_lvid(-1);
+          if(graph.vid2lvid.find(rec.target) == graph.vid2lvid.end()) {
+            if (vid2lvid_buffer.find(rec.target) == vid2lvid_buffer.end()) {                
+              target_lvid = lvid_start + vid2lvid_buffer.size();
+              vid2lvid_buffer[rec.target] = target_lvid;
+            } else {
+              target_lvid = vid2lvid_buffer[rec.target];
+            }
+          } else {
+            target_lvid = graph.vid2lvid[rec.target];
+            updated_lvids.set_bit(target_lvid);
+          }
+          graph.local_graph.add_edge(source_lvid, target_lvid, rec.edata);
+        } // end for loop over buffers
+        hybrid_edges.clear();
+
+        ASSERT_EQ(graph.vid2lvid.size() + vid2lvid_buffer.size(), 
+                  graph.local_graph.num_vertices());
+#ifdef TUNING
+        if(l_procid == 0)  {
+          logstream(LOG_INFO) << "populating local graph: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+        // Finalize local graph
+        graph.local_graph.finalize();
+#ifdef TUNING
+        logstream(LOG_INFO) << "local graph info: " << std::endl
+                            << "\t nverts: " << graph.local_graph.num_vertices()
+                            << std::endl
+                            << "\t nedges: " << graph.local_graph.num_edges()
+                            << std::endl;
+        
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "finalizing local graph: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*             Receive and add vertex data to masters                     */
+      /*                                                                        */
+      /**************************************************************************/
+      // Setup the map containing all the vertices being negotiated by this machine
+      {
+        // receive any vertex data sent by other machines
+        if (hybrid_vertex_exchange.size() > 0) {
+          vertex_buffer_type vertex_buffer; procid_t sending_proc(-1);
+          while(hybrid_vertex_exchange.recv(sending_proc, vertex_buffer)) {
+            foreach(const vertex_buffer_record& rec, vertex_buffer) {
+              lvid_type lvid(-1);
+              if (graph.vid2lvid.find(rec.vid) == graph.vid2lvid.end()) {
+                if (vid2lvid_buffer.find(rec.vid) == vid2lvid_buffer.end()) {
+                  lvid = lvid_start + vid2lvid_buffer.size();
+                  vid2lvid_buffer[rec.vid] = lvid;
+                } else {
+                  lvid = vid2lvid_buffer[rec.vid];
+                }
+              } else {
+                lvid = graph.vid2lvid[rec.vid];
+                updated_lvids.set_bit(lvid);
+              }
+              if (distributed_hybrid_ingress::vertex_combine_strategy 
+                && lvid < graph.num_local_vertices()) {
+                distributed_hybrid_ingress::vertex_combine_strategy(
+                  graph.l_vertex(lvid).data(), rec.vdata);
+              } else {
+                graph.local_graph.add_vertex(lvid, rec.vdata);
+              }
+            }
+          }
+          hybrid_vertex_exchange.clear();
+#ifdef TUNING
+          logstream(LOG_INFO) << "base::#vert-msgs=" << hybrid_vertex_exchange.size()
+                              << std::endl;
+          if(l_procid == 0) {
+            logstream(LOG_INFO) << "adding vertex data: " 
+                                << ti.current_time()
+                                << " secs" 
+                                << std::endl;
+          }
+#endif
+        }
+      } // end of loop to populate vrecmap
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*        Assign vertex data and allocate vertex (meta)data  space        */
+      /*                                                                        */
+      /**************************************************************************/
+      {
+        // determine masters for all negotiated vertices
+        const size_t local_nverts = graph.vid2lvid.size() + vid2lvid_buffer.size();
+        graph.lvid2record.reserve(local_nverts);
+        graph.lvid2record.resize(local_nverts);
+        graph.local_graph.resize(local_nverts);
+        foreach(const vid2lvid_pair_type& pair, vid2lvid_buffer) {
+          vertex_record& vrec = graph.lvid2record[pair.second];
+          vrec.gvid = pair.first;
+          if (standalone)
+            vrec.owner = 0;
+          else
+            vrec.owner = graph_hash::hash_vertex(pair.first) % nprocs;
+        }
+        ASSERT_EQ(local_nverts, graph.local_graph.num_vertices());
+        ASSERT_EQ(graph.lvid2record.size(), graph.local_graph.num_vertices());
+#ifdef TUNING
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "allocating lvid2record: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                          Master handshake                              */
+      /*                                                                        */
+      /**************************************************************************/      
+      if (!standalone) {
+#ifdef _OPENMP
+        buffered_exchange<vertex_id_type> vid_buffer(hybrid_rpc.dc(), omp_get_max_threads());
+#else
+        buffered_exchange<vertex_id_type> vid_buffer(hybrid_rpc.dc());
+#endif
+
+#ifdef _OPENMP
+#pragma omp parallel for
+#endif
+        // send not owned vids to their master
+        for (lvid_type i = lvid_start; i < graph.lvid2record.size(); ++i) {
+          procid_t master = graph.lvid2record[i].owner;
+          if (master != l_procid)
+#ifdef _OPENMP
+            vid_buffer.send(master, graph.lvid2record[i].gvid, omp_get_thread_num());
+#else
+            vid_buffer.send(master, graph.lvid2record[i].gvid);
+#endif
+        }
+        vid_buffer.flush();
+        hybrid_rpc.barrier();
+
+        // receive all vids owned by me
+        mutex flying_vids_lock;
+        boost::unordered_map<vertex_id_type, mirror_type> flying_vids;
+#ifdef _OPENMP
+#pragma omp parallel
+#endif
+        {
+          typename buffered_exchange<vertex_id_type>::buffer_type buffer;
+          procid_t recvid = -1;
+          while(vid_buffer.recv(recvid, buffer)) {
+            foreach(const vertex_id_type vid, buffer) {
+              if (graph.vid2lvid.find(vid) == graph.vid2lvid.end()) {
+                if (vid2lvid_buffer.find(vid) == vid2lvid_buffer.end()) {
+                  flying_vids_lock.lock();
+                  mirror_type& mirrors = flying_vids[vid];
+                  mirrors.set_bit(recvid);
+                  flying_vids_lock.unlock();
+                } else {
+                  lvid_type lvid = vid2lvid_buffer[vid];
+                  graph.lvid2record[lvid]._mirrors.set_bit(recvid);
+                }
+              } else {
+                lvid_type lvid = graph.vid2lvid[vid];
+                graph.lvid2record[lvid]._mirrors.set_bit(recvid);
+                updated_lvids.set_bit(lvid);
+              }
+            }
+          }
+        }
+        vid_buffer.clear();
+
+        if (!flying_vids.empty()) {
+          logstream(LOG_INFO) << "#flying-own-nverts="
+                              << flying_vids.size() 
+                              << std::endl;
+
+          // reallocate spaces for the flying vertices. 
+          size_t vsize_old = graph.lvid2record.size();
+          size_t vsize_new = vsize_old + flying_vids.size();
+          graph.lvid2record.resize(vsize_new);
+          graph.local_graph.resize(vsize_new);
+          for (typename boost::unordered_map<vertex_id_type, mirror_type>::iterator it = flying_vids.begin();
+               it != flying_vids.end(); ++it) {
+            lvid_type lvid = lvid_start + vid2lvid_buffer.size();
+            vertex_record& vrec = graph.lvid2record[lvid];
+            vertex_id_type gvid = it->first;
+            vrec.owner = l_procid;
+            vrec.gvid = gvid;
+            vrec._mirrors = it->second;
+            vid2lvid_buffer[gvid] = lvid;
+          }
+        }
+      } // end of master handshake
+
+#ifdef TUNING
+      if(l_procid == 0) {
+        logstream(LOG_INFO) << "master handshake: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*                        Merge in vid2lvid_buffer                        */
+      /*                                                                        */
+      /**************************************************************************/
+      {
+        if (graph.vid2lvid.size() == 0) {
+          graph.vid2lvid.swap(vid2lvid_buffer);
+        } else {
+          graph.vid2lvid.rehash(graph.vid2lvid.size() + vid2lvid_buffer.size());
+          foreach (const typename vid2lvid_map_type::value_type& pair, vid2lvid_buffer) {
+            graph.vid2lvid.insert(pair);
+          }
+          vid2lvid_buffer.clear();
+        }
+      }
+
+
+      /**************************************************************************/
+      /*                                                                        */
+      /*              Synchronize vertex data and meta information              */
+      /*                                                                        */
+      /**************************************************************************/
+      // TODO:  optimization for standalone
+      {
+        // construct the vertex set of changed vertices
+        
+        // Fast pass for first time finalize;
+        vertex_set changed_vset(true);
+
+        // Compute the vertices that needs synchronization 
+        if (!first_time_finalize) {
+          vertex_set changed_vset = vertex_set(false);
+          changed_vset.make_explicit(graph);
+
+          updated_lvids.resize(graph.num_local_vertices());
+          for (lvid_type i = lvid_start; i <  graph.num_local_vertices(); ++i) {
+            updated_lvids.set_bit(i);
+          }
+          changed_vset.localvset = updated_lvids; 
+          buffered_exchange<vertex_id_type> vset_exchange(hybrid_rpc.dc());
+          // sync vset with all mirrors
+          changed_vset.synchronize_mirrors_to_master_or(graph, vset_exchange);
+          changed_vset.synchronize_master_to_mirrors(graph, vset_exchange);
+        }
+
+        graphlab::graph_gather_apply<graph_type, vertex_negotiator_record> 
+            vrecord_sync_gas(graph, 
+                             boost::bind(&distributed_hybrid_ingress::finalize_gather, this, _1, _2), 
+                             boost::bind(&distributed_hybrid_ingress::finalize_apply, this, _1, _2, _3));
+        vrecord_sync_gas.exec(changed_vset);
+
+#ifdef TUNING
+        if(l_procid == 0) {
+          logstream(LOG_INFO) << "synchrionizing vertex (meta)data: " 
+                              << ti.current_time()
+                              << " secs" 
+                              << std::endl;
+        }
+#endif
+      }
+
+      base_type::exchange_global_info(standalone);
+#ifdef TUNING
+      if(l_procid == 0) {
+        logstream(LOG_INFO) << "exchange global info: " 
+                            << ti.current_time()
+                            << " secs" 
+                            << std::endl;
+      }
+#endif
+
+      if(l_procid == 0) {
+        memory_info::log_usage("base finalizing done.");
+        logstream(LOG_EMPH) << "base finalizing. (" 
+                            << ti.current_time() 
+                            << " secs)" 
+                            << std::endl;
+      }
+    } // end of modified base finalize
+
+  private:
+    boost::function<void(vertex_data_type&, const vertex_data_type&)> vertex_combine_strategy;
+
+    /**
+     * \brief Gather the vertex distributed meta data.
+     */
+    vertex_negotiator_record finalize_gather(lvid_type& lvid, graph_type& graph) {
+      vertex_negotiator_record accum;
+      accum.num_in_edges = graph.local_graph.num_in_edges(lvid);
+      accum.num_out_edges = graph.local_graph.num_out_edges(lvid);
+      if (graph.l_is_master(lvid)) {
+        accum.has_data = true;
+        accum.vdata = graph.l_vertex(lvid).data();
+        accum.mirrors = graph.lvid2record[lvid]._mirrors;
+      }
+      return accum;
+    }
+
+    /**
+     * \brief Update the vertex data structures with the gathered vertex metadata.  
+     */
+    void finalize_apply(lvid_type lvid, const vertex_negotiator_record& accum, graph_type& graph) {
+      typename graph_type::vertex_record& vrec = graph.lvid2record[lvid];
+      vrec.num_in_edges = accum.num_in_edges;
+      vrec.num_out_edges = accum.num_out_edges;
+      graph.l_vertex(lvid).data() = accum.vdata;
+      vrec._mirrors = accum.mirrors;    
+    }
+  }; // end of distributed_hybrid_ingress
+}; // end of namespace graphlab
+#include <graphlab/macros_undef.hpp>
+
+
+#endif
diff --git a/src/graphlab/graph/ingress/distributed_ingress_base.hpp b/src/graphlab/graph/ingress/distributed_ingress_base.hpp
index db682c71eb..9346e919d1 100644
--- a/src/graphlab/graph/ingress/distributed_ingress_base.hpp
+++ b/src/graphlab/graph/ingress/distributed_ingress_base.hpp
@@ -257,10 +257,10 @@ namespace graphlab {
       /**************************************************************************/
       { // Add all the edges to the local graph
         logstream(LOG_INFO) << "Graph Finalize: constructing local graph" << std::endl;
-        const size_t nedges = edge_exchange.size()+1;
+        const size_t nedges = edge_exchange.size() + 1;
         graph.local_graph.reserve_edge_space(nedges + 1);      
         edge_buffer_type edge_buffer;
-        procid_t proc;
+        procid_t proc(-1);
         while(edge_exchange.recv(proc, edge_buffer)) {
           foreach(const edge_buffer_record& rec, edge_buffer) {
             // Get the source_vlid;
@@ -505,13 +505,13 @@ namespace graphlab {
           memory_info::log_usage("Finished synchronizing vertex (meta)data");
       }
 
-      exchange_global_info();
+      exchange_global_info(false);
     } // end of finalize
 
 
     /* Exchange graph statistics among all nodes and compute
      * global statistics for the distributed graph. */
-    void exchange_global_info () {
+    void exchange_global_info (bool standalone) {
       // Count the number of vertices owned locally
       graph.local_own_nverts = 0;
       foreach(const vertex_record& record, graph.lvid2record)
@@ -521,33 +521,58 @@ namespace graphlab {
       logstream(LOG_INFO)
         << "Graph Finalize: exchange global statistics " << std::endl;
 
-      // Compute edge counts
-      std::vector<size_t> swap_counts(rpc.numprocs());
-      swap_counts[rpc.procid()] = graph.num_local_edges();
-      rpc.all_gather(swap_counts);
-      graph.nedges = 0;
-      foreach(size_t count, swap_counts) graph.nedges += count;
+      if (standalone) {
+        graph.nedges = graph.num_local_edges();
+        graph.nverts = graph.num_local_own_vertices();
+        graph.nreplicas = graph.num_local_vertices();
+      } else {
+        // Compute edge counts
+        std::vector<size_t> swap_counts(rpc.numprocs());
+        swap_counts[rpc.procid()] = graph.num_local_edges();
+        rpc.all_gather(swap_counts);
+        graph.nedges = 0;
+        foreach(size_t count, swap_counts) graph.nedges += count;
+        if (rpc.procid() == 0) {
+          size_t max = *std::max_element(swap_counts.begin(), swap_counts.end());
+          logstream(LOG_EMPH) << "edges balance: " 
+                              << (double) max / ((double) graph.nedges / rpc.numprocs())
+                              << std::endl;
+        }
 
 
-      // compute vertex count
-      swap_counts[rpc.procid()] = graph.num_local_own_vertices();
-      rpc.all_gather(swap_counts);
-      graph.nverts = 0;
-      foreach(size_t count, swap_counts) graph.nverts += count;
+        // compute vertex count
+        swap_counts[rpc.procid()] = graph.num_local_own_vertices();
+        rpc.all_gather(swap_counts);
+        graph.nverts = 0;
+        foreach(size_t count, swap_counts) graph.nverts += count;
+        if (rpc.procid() == 0) {
+          size_t max = *std::max_element(swap_counts.begin(), swap_counts.end());
+          logstream(LOG_EMPH) << "own vertices balance: " 
+                              << (double) max / ((double) graph.nverts / rpc.numprocs())
+                              << std::endl;
+        }
 
-      // compute replicas
-      swap_counts[rpc.procid()] = graph.num_local_vertices();
-      rpc.all_gather(swap_counts);
-      graph.nreplicas = 0;
-      foreach(size_t count, swap_counts) graph.nreplicas += count;
+        // compute replicas
+        swap_counts[rpc.procid()] = graph.num_local_vertices();
+        rpc.all_gather(swap_counts);
+        graph.nreplicas = 0;
+        foreach(size_t count, swap_counts) graph.nreplicas += count;
+        if (rpc.procid() == 0) {
+          size_t max = *std::max_element(swap_counts.begin(), swap_counts.end());
+          logstream(LOG_EMPH) << "local vertices balance: " 
+                              << (double) max / ((double) graph.nreplicas / rpc.numprocs())
+                              << std::endl;
+        }
 
+      }
 
       if (rpc.procid() == 0) {
         logstream(LOG_EMPH) << "Graph info: "  
                             << "\n\t nverts: " << graph.num_vertices()
                             << "\n\t nedges: " << graph.num_edges()
                             << "\n\t nreplicas: " << graph.nreplicas
-                            << "\n\t replication factor: " << (double)graph.nreplicas/graph.num_vertices()
+                            << "\n\t replication factor: " 
+                            << (double)graph.nreplicas/graph.num_vertices()
                             << std::endl;
       }
     }
diff --git a/src/graphlab/graph/ingress/ingress_edge_decision.hpp b/src/graphlab/graph/ingress/ingress_edge_decision.hpp
index ef5e4d3c9c..d2d0a92840 100644
--- a/src/graphlab/graph/ingress/ingress_edge_decision.hpp
+++ b/src/graphlab/graph/ingress/ingress_edge_decision.hpp
@@ -28,6 +28,7 @@
 #include <graphlab/rpc/distributed_event_log.hpp>
 #include <graphlab/util/dense_bitset.hpp>
 #include <boost/random/uniform_int_distribution.hpp>
+#include <math.h>
 
 namespace graphlab {
   template<typename VertexData, typename EdgeData>
@@ -39,7 +40,7 @@ namespace graphlab {
     public:
       typedef graphlab::vertex_id_type vertex_id_type;
       typedef distributed_graph<VertexData, EdgeData> graph_type;
-      typedef fixed_dense_bitset<RPC_MAX_N_PROCS> bin_counts_type; 
+      typedef fixed_dense_bitset<RPC_MAX_N_PROCS> bin_counts_type;     
 
     public:
       /** \brief A decision object for computing the edge assingment. */
diff --git a/src/graphlab/rpc/fiber_async_consensus.hpp b/src/graphlab/rpc/fiber_async_consensus.hpp
index 9b4c3de4a1..9af2fda1b3 100644
--- a/src/graphlab/rpc/fiber_async_consensus.hpp
+++ b/src/graphlab/rpc/fiber_async_consensus.hpp
@@ -80,7 +80,7 @@ namespace graphlab {
    * \endcode
    * 
    * Additionally, incoming RPC calls which create work must ensure there are
-   * active fiberswhich are capable of processing the work. An easy solution 
+   * active fibers which are capable of processing the work. An easy solution 
    * will be to simply cancel_one(). Other more optimized solutions
    * include keeping a counter of the number of active fibers, and only calling
    * cancel() or cancel_one() if all fibers are asleep. (Note that the optimized
diff --git a/src/graphlab/scheduler/fifo_scheduler.cpp b/src/graphlab/scheduler/fifo_scheduler.cpp
index 1896258924..30b64f2854 100644
--- a/src/graphlab/scheduler/fifo_scheduler.cpp
+++ b/src/graphlab/scheduler/fifo_scheduler.cpp
@@ -53,6 +53,10 @@ fifo_scheduler::fifo_scheduler(size_t num_vertices,
   ASSERT_GE(opts.get_ncpus(), 1);
   set_options(opts);
   initialize_data_structures();
+  logstream(LOG_INFO) << "FIFO Scheduler:"
+                      << " multi=" << multi
+                      << std::endl;
+
 }
 
 
diff --git a/src/graphlab/scheduler/priority_scheduler.cpp b/src/graphlab/scheduler/priority_scheduler.cpp
index b0e0ec532e..29294e2f57 100644
--- a/src/graphlab/scheduler/priority_scheduler.cpp
+++ b/src/graphlab/scheduler/priority_scheduler.cpp
@@ -56,6 +56,10 @@ priority_scheduler::priority_scheduler(size_t num_vertices,
   ASSERT_GE(opts.get_ncpus(), 1);
   set_options(opts);
   initialize_data_structures();
+  logstream(LOG_INFO) << "Priority Scheduler:"
+                    << " min_priority=" << min_priority 
+                    << " multi=" << multi 
+                    << std::endl;
 }
 
 
diff --git a/src/graphlab/scheduler/queued_fifo_scheduler.cpp b/src/graphlab/scheduler/queued_fifo_scheduler.cpp
index b3a5a42129..bf8f1949be 100644
--- a/src/graphlab/scheduler/queued_fifo_scheduler.cpp
+++ b/src/graphlab/scheduler/queued_fifo_scheduler.cpp
@@ -58,6 +58,11 @@ queued_fifo_scheduler::queued_fifo_scheduler(size_t num_vertices,
       ASSERT_GE(opts.get_ncpus(), 1);
       set_options(opts);
       initialize_data_structures();
+
+      logstream(LOG_INFO) << "Queued-FIFO Scheduler:"
+                          << " queuesize=" << sub_queue_size 
+                          << " multi=" << multi 
+                          << std::endl;
     }
 
 void queued_fifo_scheduler::set_num_vertices(const lvid_type numv) {
diff --git a/src/graphlab/scheduler/sweep_scheduler.cpp b/src/graphlab/scheduler/sweep_scheduler.cpp
index 33ba7a258b..f8e2651fb0 100644
--- a/src/graphlab/scheduler/sweep_scheduler.cpp
+++ b/src/graphlab/scheduler/sweep_scheduler.cpp
@@ -82,6 +82,13 @@ sweep_scheduler::sweep_scheduler(size_t num_vertices,
     for(size_t i = 0; i < cpu2index.size(); ++i) cpu2index[i] = i;
   }
   vertex_is_scheduled.resize(num_vertices);
+
+  logstream(LOG_INFO) << "Sweep Scheduler:"
+                      << " order=" << ordering
+                      << " strict=" << strict_round_robin
+                      << " max_iterations=" << max_iterations
+                      << std::endl;
+
 } // end of constructor
 
 
diff --git a/src/graphlab/util/tetrad.hpp b/src/graphlab/util/tetrad.hpp
new file mode 100755
index 0000000000..6decf7f7c0
--- /dev/null
+++ b/src/graphlab/util/tetrad.hpp
@@ -0,0 +1,122 @@
+/**  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * author: rong chen (rongchen@sjtu.edu.cn) 2013.7
+ *
+ */
+
+#ifndef GRAPHLAB_TETRAD_HPP
+#define GRAPHLAB_TETRAD_HPP
+
+
+#include <iostream>
+
+#include <graphlab/serialization/iarchive.hpp>
+#include <graphlab/serialization/oarchive.hpp>
+
+namespace graphlab {
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  struct tetrad {
+    typedef _T1 first_type;
+    typedef _T2 second_type;
+    typedef _T3 third_type;
+    typedef _T4 fourth_type;
+  
+    first_type first;
+    second_type second;
+    third_type third;
+    fourth_type fourth;
+    
+    tetrad() : first(_T1()), second(_T2()), third(_T3()), fourth(_T4()) {}
+    tetrad(const _T1& x, const _T2& y, const _T3& z, const _T4& w) : 
+            first(x), second(y), third(z), fourth(w) {}
+  
+    tetrad(const tetrad<_T1, _T2, _T3, _T4>& o) : 
+            first(o.first), second(o.second), third(o.third), fourth(o.fourth){}
+
+    void load(iarchive& iarc) {
+      iarc >> first;
+      iarc >> second;
+      iarc >> third;
+      iarc >> fourth;
+    }
+
+    void save(oarchive& oarc) const {
+      oarc << first;
+      oarc << second;
+      oarc << third;
+      oarc << fourth;
+    }
+  };
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline bool operator == (const tetrad<_T1, _T2, _T3, _T4>& x, 
+                           const tetrad<_T1, _T2, _T3, _T4>& y) { 
+    return x.first == y.first && x.second == y.second 
+        && x.third == y.third && x.fourth == y.fourth; 
+  }
+  
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline bool operator < (const tetrad<_T1, _T2, _T3, _T4>& l, 
+                           const tetrad<_T1, _T2, _T3, _T4>& r) { 
+    return (l.first < r.first) || 
+           (!(r.first < l.first) 
+              && (l.second < r.second)) || 
+           (!(r.first < l.first) 
+              && !(r.second < l.second) 
+              && (l.third < r.third)) ||
+           (!(r.first < l.first) 
+              && !(r.second < l.second) 
+              && !(r.third < l.third)
+              && (l.fourth< r.fourth)); 
+
+  }
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline bool operator != (const tetrad<_T1, _T2, _T3, _T4>& l, 
+                            const tetrad<_T1, _T2, _T3, _T4>& r) {
+    return !(l == r);
+  }
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline bool operator > (const tetrad<_T1, _T2, _T3, _T4>& l, 
+                          const tetrad<_T1, _T2, _T3, _T4>& r) {
+    return r < l;
+  }
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline bool operator <= (const tetrad<_T1, _T2, _T3, _T4>& l, 
+                            const tetrad<_T1, _T2, _T3, _T4>& r) {
+    return !(r < l);
+  }
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline bool operator >= (const tetrad<_T1, _T2, _T3, _T4>& l, 
+                            const tetrad<_T1, _T2, _T3, _T4>& r) {
+    return !(l < r);
+  }
+
+  template <typename _T1, typename _T2, typename _T3, typename _T4>
+  inline tetrad<_T1, _T2, _T3, _T4> make_tetrad(
+        const _T1& x, const _T2& y, const _T3& z, const _T4& w) {
+    return tetrad<_T1, _T2, _T3, _T4>(x, y, z, w);
+  }
+}; // end of graphlab namespace
+
+#endif
+
+
diff --git a/src/graphlab/util/triple.hpp b/src/graphlab/util/triple.hpp
new file mode 100755
index 0000000000..e1f324a37c
--- /dev/null
+++ b/src/graphlab/util/triple.hpp
@@ -0,0 +1,112 @@
+/**  
+ * Copyright (c) 2013 Shanghai Jiao Tong University. 
+ *     All rights reserved.
+ *
+ *  Licensed under the Apache License, Version 2.0 (the "License");
+ *  you may not use this file except in compliance with the License.
+ *  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an "AS
+ *  IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+ *  express or implied.  See the License for the specific language
+ *  governing permissions and limitations under the License.
+ *
+ * author: rong chen (rongchen@sjtu.edu.cn) 2013.7
+ *
+ */
+
+#ifndef GRAPHLAB_TRIPLE_HPP
+#define GRAPHLAB_TRIPLE_HPP
+
+
+#include <iostream>
+
+#include <graphlab/serialization/iarchive.hpp>
+#include <graphlab/serialization/oarchive.hpp>
+
+namespace graphlab {
+
+  template <typename _T1, typename _T2, typename _T3>
+  struct triple {
+    typedef _T1 first_type;
+    typedef _T2 second_type;
+    typedef _T3 third_type;
+  
+    first_type first;
+    second_type second;
+    third_type third;
+    
+    triple() : first(_T1()), second(_T2()), third(_T3()) {}
+    triple(const _T1& x, const _T2& y, const _T3& z) : 
+            first(x), second(y), third(z) {}
+  
+    triple(const triple<_T1, _T2, _T3>& o) : 
+            first(o.first), second(o.second), third(o.third){}
+
+    void load(iarchive& iarc) {
+      iarc >> first;
+      iarc >> second;
+      iarc >> third;
+    }
+
+    void save(oarchive& oarc) const {
+      oarc << first;
+      oarc << second;
+      oarc << third;
+    }
+  };
+
+  template <typename _T1, typename _T2, typename _T3>
+  inline bool operator == (const triple<_T1, _T2, _T3>& x,
+                            const triple<_T1, _T2, _T3>& y) { 
+    return x.first == y.first && x.second == y.second && x.third == y.third; 
+  }
+  
+  template <typename _T1, typename _T2, typename _T3>
+  inline bool operator < (const triple<_T1, _T2, _T3>& l, 
+                          const triple<_T1, _T2, _T3>& r) { 
+    return (l.first < r.first) || 
+           (!(r.first < l.first) 
+              && (l.second < r.second)) || 
+           (!(r.first < l.first) 
+              && !(r.second < l.second) 
+              && (l.third < r.third)); 
+
+  }
+
+  template <typename _T1, typename _T2, typename _T3>
+  inline bool operator != (const triple<_T1, _T2, _T3>& l, 
+                            const triple<_T1, _T2, _T3>& r) {
+    return !(l == r);
+  }
+
+  template <typename _T1, typename _T2, typename _T3>
+  inline bool operator > (const triple<_T1, _T2, _T3>& l, 
+                          const triple<_T1, _T2, _T3>& r) {
+    return r < l;
+  }
+
+  template <typename _T1, typename _T2, typename _T3>
+  inline bool operator <= (const triple<_T1, _T2, _T3>& l, 
+                            const triple<_T1, _T2, _T3>& r) {
+    return !(r < l);
+  }
+
+  template <typename _T1, typename _T2, typename _T3>
+  inline bool operator >= (const triple<_T1, _T2, _T3>& l, 
+                            const triple<_T1, _T2, _T3>& r) {
+    return !(l < r);
+  }
+
+  template <typename _T1, typename _T2, typename _T3>
+  inline triple<_T1, _T2, _T3> make_triple(
+        const _T1& x, const _T2& y, const _T3& z) {
+    return triple<_T1, _T2, _T3>(x, y, z);
+  }
+}; // end of graphlab namespace
+
+#endif
+
diff --git a/src/graphlab/vertex_program/context.hpp b/src/graphlab/vertex_program/context.hpp
index 2cddb52cb9..24f3b6f247 100644
--- a/src/graphlab/vertex_program/context.hpp
+++ b/src/graphlab/vertex_program/context.hpp
@@ -119,10 +119,21 @@ namespace graphlab {
      * Send a message to a vertex.
      */
     void signal(const vertex_type& vertex, 
-                const message_type& message = message_type()) {
+                const message_type& message) {
       engine.internal_signal(vertex, message);
     }
 
+    /**
+     * Signal a vertex without a message. 
+     *
+     * This new interface can avoid contention on vertex with a large number of in-edges
+     * for applications that scatter neighboring but without messages 
+     * For example: PageRank with dynamic computation
+     */
+    void signal(const vertex_type& vertex) {
+      engine.internal_signal(vertex);
+    }
+
     /**
      * Send a message to an arbitrary vertex ID.
      * \warning If sending to neighboring vertices, the \ref signal()
diff --git a/src/graphlab/vertex_program/icontext.hpp b/src/graphlab/vertex_program/icontext.hpp
index 4038f2b30b..d6d315bace 100644
--- a/src/graphlab/vertex_program/icontext.hpp
+++ b/src/graphlab/vertex_program/icontext.hpp
@@ -211,7 +211,9 @@ namespace graphlab {
      * \param message [in] The message to send, defaults to message_type(). 
      */
     virtual void signal(const vertex_type& vertex, 
-                        const message_type& message = message_type()) { }
+                        const message_type& message) { }
+
+    virtual void signal(const vertex_type& vertex) { }
 
     /**
      * \brief Send a message to a vertex ID.
diff --git a/tests/simulate_powerlaw_replica.cpp b/tests/simulate_powerlaw_replica.cpp
new file mode 100644
index 0000000000..c061c3a1a3
--- /dev/null
+++ b/tests/simulate_powerlaw_replica.cpp
@@ -0,0 +1,89 @@
+#include <iostream>
+#include <math.h>
+using namespace std;
+
+double h(int num,double alpha){
+	double sum=0;
+	for(int i=1;i<=num;i++){
+		sum+=pow(i,-alpha);
+	}
+	return sum;
+}
+double random_replication(int V,double alpha,int p){
+	double E=h(V,alpha-1)/h(V,alpha)*V;
+	double tmp=V/h(V,alpha);
+	double sum=0;
+	for(int i=1;i<=V;i++){
+		double num_replica=p*(1-pow((1-1.0/p),i+E/V));
+		num_replica+=(p-num_replica)/p;
+		sum+=tmp * pow(i,-alpha)* num_replica;
+	}
+	return sum/V;
+}
+
+double grid_replication(int V,double alpha,int p){
+	double E=h(V,alpha-1)/h(V,alpha)*V;
+	double fp=2*pow(p,0.5)-1;
+	double tmp=V/h(V,alpha);
+	double sum=0;
+	for(int i=1;i<=V;i++){
+		double num_replica=fp*(1-pow((1-1.0/fp),i+E/V));
+		num_replica+=(fp-num_replica)/fp;
+		sum+=tmp * pow(i,-alpha)* num_replica;
+	}
+	return sum/V;
+}
+double hybrid_replication(int V,double alpha,int p,int threshold){
+	double E=h(V,alpha-1)/h(V,alpha)*V;
+	double tmp=V/h(V,alpha);
+	double sum=0;
+	double R_E_H=1-h(threshold,alpha-1)/h(V,alpha-1);
+	for(int i=1;i<=V;i++){
+		double num_replica;
+		if(i<=threshold){
+			num_replica=p*(1-pow((1-1.0/p),E/V*(1-R_E_H)));
+		} else {
+			num_replica=p*(1-pow((1-1.0/p),i+E/V*(1-R_E_H)));
+		}
+		num_replica+=(p-num_replica)/p;
+		sum+=tmp * pow(i,-alpha)* num_replica;
+	}
+	return sum/V;
+}
+
+int main(){
+	int V=10000000;//10m
+	cout<<"alpha\trandom\tgrid\thybrid\tp=48"<<endl;
+	for(double alpha=1.8;alpha<=2.2;alpha+=0.1){
+		cout<<alpha<<"\t"
+			<<random_replication(V,alpha,48)<<"\t"
+			<<grid_replication(V,alpha,48)<<"\t"
+			<<hybrid_replication(V,alpha,48,100)<<endl;
+	}
+
+	cout<<"p\t";
+	for(double alpha=1.8;alpha<=2.2;alpha+=0.1){
+		cout<<alpha<<"\t";
+	}
+	cout<<"grid/hybrid"<<endl;
+	for(int i=1;i<=15;i++){
+		cout<<i*i<<"\t";
+		for(double alpha=1.8;alpha<=2.2;alpha+=0.1){
+			cout<<grid_replication(V,alpha,i*i)/hybrid_replication(V,alpha,i*i,100)<<"\t";
+		}
+		cout<<endl;
+	}
+
+	cout<<"p\t";
+	for(double alpha=1.8;alpha<=2.2;alpha+=0.1){
+		cout<<alpha<<"\t";
+	}
+	cout<<"random/hybrid"<<endl;
+	for(int i=1;i<=15;i++){
+		cout<<i*i<<"\t";
+		for(double alpha=1.8;alpha<=2.2;alpha+=0.1){
+			cout<<random_replication(V,alpha,i*i)/hybrid_replication(V,alpha,i*i,100)<<"\t";
+		}
+		cout<<endl;
+	}
+}
diff --git a/toolkits/collaborative_filtering/als.cpp b/toolkits/collaborative_filtering/als.cpp
index 249c0d2919..5221531df3 100644
--- a/toolkits/collaborative_filtering/als.cpp
+++ b/toolkits/collaborative_filtering/als.cpp
@@ -619,20 +619,29 @@ int main(int argc, char** argv) {
   graphlab::timer timer; 
   graph_type graph(dc, clopts);  
   graph.load(input_dir, graph_loader); 
+  const double loading = timer.current_time();
   dc.cout() << "Loading graph. Finished in " 
-            << timer.current_time() << std::endl;
+            << loading << std::endl;
 
   if (dc.procid() == 0) 
     add_implicit_edges<edge_data>(implicitratingtype, graph, dc);
   
   dc.cout() << "Finalizing graph." << std::endl;
   timer.start();
-   graph.finalize();
+  graph.finalize();
+  const double finalizing = timer.current_time();
   dc.cout() << "Finalizing graph. Finished in " 
-            << timer.current_time() << std::endl;
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
 
   if (!graph.num_edges() || !graph.num_vertices())
-     logstream(LOG_FATAL)<< "Failed to load graph. Check your input path: " << input_dir << std::endl;     
+    logstream(LOG_FATAL) << "Failed to load graph. Check your input path: " 
+                         << input_dir << std::endl;     
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
 
 
   dc.cout() 
@@ -686,7 +695,7 @@ int main(int argc, char** argv) {
   // Compute the final training error -----------------------------------------
   dc.cout() << "Final error: " << std::endl;
   engine.aggregate_now("error");
-
+  
   // Make predictions ---------------------------------------------------------
   if(!predictions.empty()) {
     std::cout << "Saving predictions" << std::endl;
@@ -699,13 +708,11 @@ int main(int argc, char** argv) {
                true, threads_per_machine);
     //save the linear model
     graph.save(predictions + ".U", linear_model_saver_U(),
-		gzip_output, true, false, threads_per_machine);
+               gzip_output, true, false, threads_per_machine);
     graph.save(predictions + ".V", linear_model_saver_V(),
-		gzip_output, true, false, threads_per_machine);
+               gzip_output, true, false, threads_per_machine);
   
   }
-             
-
 
   graphlab::mpi_tools::finalize();
   return EXIT_SUCCESS;
diff --git a/toolkits/collaborative_filtering/svd.cpp b/toolkits/collaborative_filtering/svd.cpp
index eb5483329a..19aec7ede0 100644
--- a/toolkits/collaborative_filtering/svd.cpp
+++ b/toolkits/collaborative_filtering/svd.cpp
@@ -65,6 +65,7 @@ bool binary = false; //if true, all edges = 1
 mat a,PT;
 bool v_vector = false;
 int input_file_offset = 0; //if set to non zero, each row/col id will be reduced the input_file_offset
+std::string exec_type = "synchronous";
 vec singular_values;
 
 DECLARE_TRACER(svd_bidiagonal);
@@ -602,7 +603,7 @@ void write_output_vector(const std::string datafile, const vec & output, bool is
 
 int main(int argc, char** argv) {
   global_logger().set_log_to_console(true);
-
+  global_logger().set_log_level(LOG_INFO);
   INITIALIZE_TRACER(svd_bidiagonal, "svd bidiagonal");
   INITIALIZE_TRACER(svd_error_estimate, "svd error estimate");
   INITIALIZE_TRACER(svd_swork, "Svd swork");
@@ -615,7 +616,6 @@ int main(int argc, char** argv) {
     "Compute the gklanczos factorization of a matrix.";
   graphlab::command_line_options clopts(description);
   std::string input_dir, output_dir;
-  std::string exec_type = "synchronous";
   clopts.attach_option("matrix", input_dir,
       "The directory containing the matrix file");
   clopts.add_positional("matrix");
@@ -635,6 +635,7 @@ int main(int argc, char** argv) {
   clopts.attach_option("predictions", predictions, "predictions file prefix");
   clopts.attach_option("binary", binary, "If true, all edges are weighted as one");
   clopts.attach_option("input_file_offset", input_file_offset, "input file node id offset (default 0)");
+  clopts.attach_option("engine", exec_type, "specify engine type");
   if(!clopts.parse(argc, argv) || input_dir == "") {
     std::cout << "Error in parsing command line arguments." << std::endl;
     clopts.print_description();
@@ -701,16 +702,28 @@ int main(int argc, char** argv) {
   graph_type graph(dc, clopts);  
   graph.load(input_dir, graph_loader); 
   pgraph = &graph;
+  const double loading = timer.current_time();
   dc.cout() << "Loading graph. Finished in " 
-    << timer.current_time() << std::endl;
+            << loading << std::endl;
+
+
   dc.cout() << "Finalizing graph." << std::endl;
   timer.start();
   graph.finalize();
+  const double finalizing = timer.current_time();
   dc.cout() << "Finalizing graph. Finished in " 
-    << timer.current_time() << std::endl;
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
+
 
   if (!graph.num_edges() || !graph.num_vertices())
-     logstream(LOG_FATAL)<< "Failed to load graph. Check your input path: " << input_dir << std::endl;     
+     logstream(LOG_FATAL) << "Failed to load graph. Check your input path: " 
+                          << input_dir << std::endl;     
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
 
   dc.cout() 
     << "========== Graph statistics on proc " << dc.procid() 
@@ -752,7 +765,8 @@ int main(int argc, char** argv) {
     for (int i=0; i< rows; i++){
       int rc = fscanf(file, "%lg\n", &val);
       if (rc != 1)
-        logstream(LOG_FATAL)<<"Failed to read initial vector (on line: "<< i << " ) " << std::endl;
+        logstream(LOG_FATAL)<<"Failed to read initial vector (on line: "<< i 
+                            << " ) " << std::endl;
       input[i] = val;
     }
     fclose(file);
@@ -764,16 +778,17 @@ int main(int argc, char** argv) {
   lanczos( info, timer, errest, vecfile);
 
   if (graphlab::mpi_tools::rank()==0)
-    write_output_vector(predictions + ".singular_values", singular_values, false, "%GraphLab SVD Solver library. This file contains the singular values.");
+    write_output_vector(predictions + ".singular_values", singular_values, false, 
+    "%GraphLab SVD Solver library. This file contains the singular values.");
 
   const double runtime = timer.current_time();
   dc.cout() << "----------------------------------------------------------"
-    << std::endl
-    << "Final Runtime (seconds):   " << runtime 
-                                        << std::endl
-                                        << "Updates executed: " << engine.num_updates() << std::endl
-                                        << "Update Rate (updates/second): " 
-                                          << engine.num_updates() / runtime << std::endl;
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
 
   // Compute the final training error -----------------------------------------
   if (unittest == 1){
@@ -790,7 +805,7 @@ int main(int argc, char** argv) {
     assert(pow(singular_values[0]-  2.16097, 2) < 1e-8);
     assert(pow(singular_values[2]-  0.554159, 2) < 1e-8);
    }
- 
+
 
 
   graphlab::mpi_tools::finalize();
diff --git a/toolkits/graph_analytics/approximate_diameter.cpp b/toolkits/graph_analytics/approximate_diameter.cpp
index e15c8b8282..c5ec778488 100644
--- a/toolkits/graph_analytics/approximate_diameter.cpp
+++ b/toolkits/graph_analytics/approximate_diameter.cpp
@@ -307,14 +307,31 @@ int main(int argc, char** argv) {
   }
 
   //load graph
-  graph_type graph(dc, clopts);
   dc.cout() << "Loading graph in format: "<< format << std::endl;
+    graphlab::timer timer; 
+  graph_type graph(dc, clopts);
   graph.load_format(graph_dir, format);
+  const double loading = timer.current_time();
+  dc.cout() << "Loading graph. Finished in " 
+            << loading << std::endl;
+
+  // must call finalize before querying the graph
+  dc.cout() << "Finalizing graph." << std::endl;
+  timer.start();
   graph.finalize();
+  const double finalizing = timer.current_time();
+  dc.cout() << "Finalizing graph. Finished in " 
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
+
+
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
 
-  time_t start, end;
   //initialize vertices
-  time(&start);
   if (use_sketch == false)
     graph.transform_vertices(initialize_vertex);
   else
@@ -322,6 +339,7 @@ int main(int argc, char** argv) {
 
   graphlab::omni_engine<one_hop> engine(dc, graph, exec_type, clopts);
 
+  timer.start();
   //main iteration
   size_t previous_count = 0;
   size_t diameter = 0;
@@ -348,13 +366,19 @@ int main(int argc, char** argv) {
     }
     previous_count = current_count;
   }
-  time(&end);
 
-  dc.cout() << "graph calculation time is " << (end - start) << " sec\n";
-  dc.cout() << "The approximate diameter is " << diameter << "\n";
+  const double runtime = timer.current_time();
+  dc.cout() << "----------------------------------------------------------"
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
 
-  graphlab::mpi_tools::finalize();
+  dc.cout() << "The approximate diameter is " << diameter << std::endl;
 
+  graphlab::mpi_tools::finalize();
   return EXIT_SUCCESS;
 }
 
diff --git a/toolkits/graph_analytics/connected_component.cpp b/toolkits/graph_analytics/connected_component.cpp
index 6f7739c8fc..7bdcd14e68 100644
--- a/toolkits/graph_analytics/connected_component.cpp
+++ b/toolkits/graph_analytics/connected_component.cpp
@@ -175,24 +175,51 @@ int main(int argc, char** argv) {
     std::cout << "--graph is not optional\n";
     return EXIT_FAILURE;
   }
-
-  graph_type graph(dc, clopts);
-
+  
   //load graph
   dc.cout() << "Loading graph in format: "<< format << std::endl;
+  graphlab::timer timer;
+  graph_type graph(dc, clopts);
   graph.load_format(graph_dir, format);
-  graphlab::timer ti;
+  const double loading = timer.current_time();
+  dc.cout() << "Loading graph. Finished in " 
+            << loading << std::endl;
+
+
+  dc.cout() << "Finalizing graph." << std::endl;
+  timer.start();
   graph.finalize();
-  dc.cout() << "Finalization in " << ti.current_time() << std::endl;
+  const double finalizing = timer.current_time();
+  dc.cout() << "Finalizing graph. Finished in " 
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
+
+
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
+
+  // init
   graph.transform_vertices(initialize_vertex);
 
   //running the engine
-  time_t start, end;
   graphlab::omni_engine<label_propagation> engine(dc, graph, exec_type, clopts);
   engine.signal_all();
-  time(&start);
+  timer.start();
   engine.start();
 
+  const double runtime = timer.current_time();
+  dc.cout() << "----------------------------------------------------------"
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
+    
+
   //write results
   if (saveprefix.size() > 0) {
     graph.save(saveprefix, graph_writer(),
diff --git a/toolkits/graph_analytics/pagerank.cpp b/toolkits/graph_analytics/pagerank.cpp
index 68f4efd2a6..b11e603a94 100644
--- a/toolkits/graph_analytics/pagerank.cpp
+++ b/toolkits/graph_analytics/pagerank.cpp
@@ -217,12 +217,14 @@ int main(int argc, char** argv) {
     // make sure this is the synchronous engine
     dc.cout() << "--iterations set. Forcing Synchronous engine, and running "
               << "for " << ITERATIONS << " iterations." << std::endl;
-    clopts.get_engine_args().set_option("type", "synchronous");
+    //clopts.get_engine_args().set_option("type", "synchronous");
     clopts.get_engine_args().set_option("max_iterations", ITERATIONS);
     clopts.get_engine_args().set_option("sched_allv", true);
   }
 
   // Build the graph ----------------------------------------------------------
+  dc.cout() << "Loading graph." << std::endl;
+  graphlab::timer timer; 
   graph_type graph(dc, clopts);
   if(powerlaw > 0) { // make a synthetic graph
     dc.cout() << "Loading synthetic Powerlaw graph." << std::endl;
@@ -237,8 +239,23 @@ int main(int argc, char** argv) {
     clopts.print_description();
     return 0;
   }
+  const double loading = timer.current_time();
+  dc.cout() << "Loading graph. Finished in " 
+            << loading << std::endl;
+
+
   // must call finalize before querying the graph
+  dc.cout() << "Finalizing graph." << std::endl;
+  timer.start();
   graph.finalize();
+  const double finalizing = timer.current_time();
+  dc.cout() << "Finalizing graph. Finished in " 
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
+
   dc.cout() << "#vertices: " << graph.num_vertices()
             << " #edges:" << graph.num_edges() << std::endl;
 
@@ -248,11 +265,16 @@ int main(int argc, char** argv) {
   // Running The Engine -------------------------------------------------------
   graphlab::omni_engine<pagerank> engine(dc, graph, exec_type, clopts);
   engine.signal_all();
+  timer.start();
   engine.start();
-  const double runtime = engine.elapsed_seconds();
-  dc.cout() << "Finished Running engine in " << runtime
-            << " seconds." << std::endl;
-
+  const double runtime = timer.current_time();
+  dc.cout() << "----------------------------------------------------------"
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
 
   const double total_rank = graph.map_reduce_vertices<double>(map_rank);
   std::cout << "Total rank: " << total_rank << std::endl;
@@ -265,9 +287,6 @@ int main(int argc, char** argv) {
                false);   // do not save edges
   }
 
-  double totalpr = graph.map_reduce_vertices<double>(pagerank_sum);
-  std::cout << "Totalpr = " << totalpr << "\n";
-
   // Tear-down communication layer and quit -----------------------------------
   graphlab::mpi_tools::finalize();
   return EXIT_SUCCESS;
diff --git a/toolkits/graph_analytics/simple_coloring.cpp b/toolkits/graph_analytics/simple_coloring.cpp
index 70efb3cd59..3dde327a1d 100644
--- a/toolkits/graph_analytics/simple_coloring.cpp
+++ b/toolkits/graph_analytics/simple_coloring.cpp
@@ -178,31 +178,32 @@ size_t validate_conflict(graph_type::edge_type& edge) {
 
 
 int main(int argc, char** argv) {
-
-  //global_logger().set_log_level(LOG_INFO);
-
   // Initialize control plane using mpi
   graphlab::mpi_tools::init(argc, argv);
   graphlab::distributed_control dc;
-
+  global_logger().set_log_level(LOG_INFO);
 
   dc.cout() << "This program computes a simple graph coloring of a"
             "provided graph.\n\n";
 
+  // Parse command line options -----------------------------------------------
   graphlab::command_line_options clopts("Graph coloring. "
     "Given a graph, this program computes a graph coloring of the graph."
     "The Asynchronous engine is used.");
   std::string prefix, format;
   std::string output;
   float alpha = 2.1;
-  size_t powerlaw = 0;
+  std::string exec_type = "asynchronous";
   clopts.attach_option("graph", prefix,
                        "Graph input. reads all graphs matching prefix*");
+  clopts.attach_option("engine", exec_type,
+                       "The asynchronous engine type (async or plasync)");
   clopts.attach_option("format", format,
                        "The graph format");
-   clopts.attach_option("output", output,
+  clopts.attach_option("output", output,
                        "A prefix to save the output.");
-   clopts.attach_option("powerlaw", powerlaw,
+  size_t powerlaw = 0;
+  clopts.attach_option("powerlaw", powerlaw,
                        "Generate a synthetic powerlaw out-degree graph. ");
       clopts.attach_option("alpha", alpha,
                        "Alpha in powerlaw distrubution");
@@ -219,10 +220,19 @@ int main(int argc, char** argv) {
   }
 
 
+  if (exec_type != "asynchronous" && exec_type != "async"
+        && exec_type != "powerlyra_asynchronous" && exec_type != "plasync"){
+    dc.cout() << "Only supports asynchronous engine" << std::endl;
+    clopts.print_description();
+    return EXIT_FAILURE;
+  }
+
   graphlab::launch_metric_server();
-  // load graph
+  
+  // Build the graph ----------------------------------------------------------
+  dc.cout() << "Loading graph." << std::endl;
+  graphlab::timer timer;
   graph_type graph(dc, clopts);
-
   if(powerlaw > 0) { // make a synthetic graph
     dc.cout() << "Loading synthetic Powerlaw graph." << std::endl;
     graph.load_synthetic_powerlaw(powerlaw, false, alpha, 100000000);
@@ -237,12 +247,25 @@ int main(int argc, char** argv) {
     }
     graph.load_format(prefix, format);
   }
+  const double loading = timer.current_time();
+  dc.cout() << "Loading graph. Finished in " 
+            << loading << std::endl;
+  
+  // must call finalize before querying the graph
+  dc.cout() << "Finalizing graph." << std::endl;
+  timer.start();
   graph.finalize();
+  const double finalizing = timer.current_time();
+  dc.cout() << "Finalizing graph. Finished in " 
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
 
-  dc.cout() << "Number of vertices: " << graph.num_vertices() << std::endl
-    << "Number of edges:    " << graph.num_edges() << std::endl;
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
 
-  graphlab::timer ti;
   
   // create engine to count the number of triangles
   dc.cout() << "Coloring..." << std::endl;
@@ -251,16 +274,29 @@ int main(int argc, char** argv) {
   } else {
     clopts.get_engine_args().set_option("factorized", true);
   } 
-  graphlab::async_consistent_engine<graph_coloring> engine(dc, graph, clopts);
+
+
+  // Running The Engine -------------------------------------------------------
+  graphlab::omni_engine<graph_coloring> engine(dc, graph, exec_type, clopts);
   engine.signal_all();
+  timer.start();
   engine.start();
 
-
-  dc.cout() << "Colored in " << ti.current_time() << " seconds" << std::endl;
+  const double runtime = timer.current_time();
+  dc.cout() << "----------------------------------------------------------"
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
   dc.cout() << "Colored using " << used_colors.size() << " colors" << std::endl;
-		  
+
+
   size_t conflict_count = graph.map_reduce_edges<size_t>(validate_conflict);
   dc.cout() << "Num conflicts = " << conflict_count << "\n";
+
+  // Save the final graph -----------------------------------------------------
   if (output != "") {
     graph.save(output,
               save_colors(),
diff --git a/toolkits/graph_analytics/sssp.cpp b/toolkits/graph_analytics/sssp.cpp
index b6e8101b9e..bc8df15e4c 100644
--- a/toolkits/graph_analytics/sssp.cpp
+++ b/toolkits/graph_analytics/sssp.cpp
@@ -89,6 +89,17 @@ struct min_distance_type : graphlab::IS_POD_TYPE {
   }
 };
 
+struct max_distance_type : graphlab::IS_POD_TYPE {
+  distance_type dist;
+  max_distance_type(distance_type dist = 
+                    std::numeric_limits<distance_type>::min()) : dist(dist) { }
+  max_distance_type& operator+=(const max_distance_type& other) {
+    dist = std::max(dist, other.dist);
+    return *this;
+  }
+};
+
+
 
 /**
  * \brief The single source shortest path vertex program.
@@ -181,7 +192,6 @@ struct shortest_path_writer {
 }; // end of shortest_path_writer
 
 
-
 struct max_deg_vertex_reducer: public graphlab::IS_POD_TYPE {
   size_t degree;
   graphlab::vertex_id_type vid;
@@ -200,6 +210,14 @@ max_deg_vertex_reducer find_max_deg_vertex(const graph_type::vertex_type vtx) {
   return red;
 }
 
+max_distance_type map_dist(const graph_type::vertex_type& v) {
+  if (v.data().dist == std::numeric_limits<distance_type>::max())
+    return std::numeric_limits<distance_type>::min();
+  
+  max_distance_type dist(v.data().dist);
+  return dist;
+}
+
 int main(int argc, char** argv) {
   // Initialize control plain using mpi
   graphlab::mpi_tools::init(argc, argv);
@@ -249,6 +267,8 @@ int main(int argc, char** argv) {
 
 
   // Build the graph ----------------------------------------------------------
+  dc.cout() << "Loading graph." << std::endl;
+  graphlab::timer timer;
   graph_type graph(dc, clopts);
   if(powerlaw > 0) { // make a synthetic graph
     dc.cout() << "Loading synthetic Powerlaw graph." << std::endl;
@@ -261,12 +281,26 @@ int main(int argc, char** argv) {
     clopts.print_description();
     return EXIT_FAILURE;
   }
+  const double loading = timer.current_time();
+  dc.cout() << "Loading graph. Finished in " 
+            << loading << std::endl;
+
   // must call finalize before querying the graph
+  dc.cout() << "Finalizing graph." << std::endl;
+  timer.start();
   graph.finalize();
-  dc.cout() << "#vertices:  " << graph.num_vertices() << std::endl
-            << "#edges:     " << graph.num_edges() << std::endl;
+  const double finalizing = timer.current_time();
+  dc.cout() << "Finalizing graph. Finished in " 
+            << finalizing << std::endl;
+
+  // NOTE: ingress time = loading time + finalizing time
+  const double ingress = loading + finalizing;
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
 
+  dc.cout() << "Final Ingress (second): " << ingress << std::endl;
 
+  dc.cout() << "#vertices: " << graph.num_vertices()
+            << " #edges:" << graph.num_edges() << std::endl;
 
   if(sources.empty()) {
     if (max_degree_source == false) {
@@ -286,22 +320,28 @@ int main(int argc, char** argv) {
   }
 
 
-
   // Running The Engine -------------------------------------------------------
   graphlab::omni_engine<sssp> engine(dc, graph, exec_type, clopts);
 
-
-  
   // Signal all the vertices in the source set
   for(size_t i = 0; i < sources.size(); ++i) {
     engine.signal(sources[i], min_distance_type(0));
   }
 
+  timer.start();
   engine.start();
-  const float runtime = engine.elapsed_seconds();
-  dc.cout() << "Finished Running engine in " << runtime
-            << " seconds." << std::endl;
-
+  const double runtime = timer.current_time();
+  dc.cout() << "----------------------------------------------------------"
+            << std::endl
+            << "Final Runtime (seconds):   " << runtime 
+            << std::endl
+            << "Updates executed: " << engine.num_updates() << std::endl
+            << "Update Rate (updates/second): " 
+            << engine.num_updates() / runtime << std::endl;
+
+  const max_distance_type max_dist = 
+      graph.map_reduce_vertices<max_distance_type>(map_dist);
+  std::cout << "Max distance: " << max_dist.dist << std::endl;
 
   // Save the final graph -----------------------------------------------------
   if (saveprefix != "") {