#### UNIVERSITY OF CALIFORNIA ### SANTA CRUZ . . . . A Heterogeneous CPU/GPU Implementation Of PageRank and Single Source Shortest Path On An Embedded Device By JJ Bowen August 2022 ### 1 Introduction The graph algorithms PageRank (PR) and Single Source Shortest Path (SSSP) are widely used in the world of modern computing. Running them on a Graphics Processing Unit (GPU) can increase their speed significantly, but certain segments of the graph are often better suited to CPU calculations. By splitting the graph into segments, one can run each segment on the processor and implementation that best suits its structure. However, this creates significant overhead, as the memory must be copied back and forth between the two processors, which slows the function down significantly. By running the SSSP algorithm on an embedded device, or a device where the CPU and GPU can access shared memory, one can increase the speed by a factor of 1.13x, however, this approach comes with its own challenges. The CPU and GPU can access the same memory, but not at the same time. This means that quite a bit of parallelization is lost in the process, which is a significant slowdown to the potential that could be achieved without this. This slowdown is particularly apparent on PR, which sees a speed decrease of over 3x. If a device existed where the CPU and GPU could both access the same memory at the same time, then I believe there would be a much more significant speed increase. ### 2 Different Device Schedules In the GitHub repository [1] I used, which was designed by Christopher Liu, Sanya Srivastava, Professor Tyler Sorensen and myself, there is an existing framework to get benchmarking results from running the two algorithms on different processors and approaches and to use these results to create a heterogeneous schedule. Depending on which device is used, however, these schedules can look radically different. The first device I used had a much larger GPU in comparison to the size of its CPU then the second one. The schedule of this device is shown below for running PR with 24 segments on a Kronecker graph [3] of scale 20. | <br> <br> Segment | Device 0<br> Intel 17-9700K<br> [1.68 ms] | Device 1 <br> NVIDIA Quadro RTX 4000 <br> [2.09 ms] | |--------------------|---------------------------------------------|---------------------------------------------------------| | 0<br> | | PR GPU block-red 256 1024 <br> [0.09 ms] | | 1 | | PR GPU block-red 1024 256 <br> [0.07 ms] | | 2 | | PR GPU block-red 1024 256 <br> [0.09 ms] | | 3 | | PR GPU block-red 4096 64 <br>[0.07 ms] | | 4 | | PR GPU block-red 4096 64 <br>[0.07 ms] | | 5 | | PR GPU block-red 4096 64 <br>[0.09 ms] | | 6 | | PR GPU warp-red 256 1024 <br>[0.08 ms] | | 7 | | PR GPU warp-red 256 1024 <br>[0.07 ms] | | 8 | | PR GPU warp-red 256 1024 <br>[0.07 ms] | | 9 | <u> </u> | PR GPU warp-red 256 1024 <br>[0.07 ms] | | 10 | | PR GPU warp-red 256 1024 <br>[0.10 ms] | | 11 | | PR GPU warp-red 256 1024 <br>[0.10 ms] | | 12 | | PR GPU warp-red 256 1024 <br>[0.10 ms] | | 13 | | PR GPU warp-red 256 1024 <br>[0.10 ms] | | 14 | | PR GPU warp-red 256 1024 <br>[0.10 ms] | | 15 | | PR GPU block-red 4096 64 <br> [0.11 ms] | | 16 | PR CPU one-to-one<br>[0.56 ms] | | | 17 | PR CPU one-to-one | | | 18 | PR CPU one-to-one<br>[0.56 ms] | | | 19 | <br> | PR GPU warp-red 256 1024 <br> [0.12 ms] | | 20 | | PR GPU warp-red 256 1024 <br> [0.12 ms] | | 21 | | PR GPU warp-red 256 1024 <br>[0.13 ms] | | 22 | | PR GPU warp-red 256 1024 <br>[0.17 ms] | | 23<br> <br> | <br> | PR GPU one-to-one 256 1024 <br> [0.17 ms] | The schedule for the second device, also running PR with 24 segments on the same graph is shown below. | +<br> | Device 0 | Device 1 | |---------------|---------------------------------|---------------------------------------------| | <br> Segment | Intel 17-9700K<br>[94.20 ms] | NVIDIA Quadro RTX 4000<br> [102.24 ms] | | 0 | | PR GPU block-red 2048 128 <br>[3.22 ms] | | 1 | | PR GPU warp-red 256 1024 <br>[2.74 ms] | | 2 | | PR GPU warp-red 256 1024 <br>[3.48 ms] | | 3 | | PR GPU warp-red 256 1024 <br>[3.41 ms] | | 4 | | PR GPU warp-red 256 1024 <br>[3.38 ms] | | 5<br> | | PR GPU warp-red 256 1024 <br> [4.17 ms] | | 6<br> | | PR GPU warp-red 256 1024 <br>[4.53 ms] | | 7 | | PR GPU warp-red 256 1024 <br>[4.48 ms] | | 8<br> | | PR GPU warp-red 256 1024 <br>[5.18 ms] | | 9 | | PR GPU warp-red 256 1024 <br>[4.63 ms] | | 10 | | PR GPU warp-red 256 1024 <br>[5.55 ms] | | 11 | | PR GPU warp-red 256 1024 <br>[5.88 ms] | | 12 | | PR GPU warp-red 256 1024 <br>[5.91 ms] | | 13 | | PR GPU warp-red 256 1024 <br>[5.86 ms] | | 14 | | PR GPU warp-red 256 1024 <br>[5.96 ms] | | 15 | | PR GPU warp-red 256 1024 <br>[6.81 ms] | | 16 | PR CPU one-to-one<br>[18.08 ms] | | | 17 | PR CPU one-to-one<br>[17.92 ms] | | | 18<br> | PR CPU one-to-one<br>[17.99 ms] | | | 19 | | PR GPU warp-red 256 1024 <br>[7.85 ms] | | 20 | PR CPU one-to-one<br>[19.88 ms] | | | 21 | PR CPU one-to-one<br>[20.33 ms] | | | 22 | | PR GPU warp-red 256 1024 <br>[9.75 ms] | | 23 | | PR GPU one-to-one 256 1024 <br> [9.45 ms] | | + | | ++ | Though these schedules are somewhat similar, they do differ a bit in certain places. The first schedule ends up using the CPU for 3 out of 24 segments, which are clustered together in one continuous block, while the second uses it for 5, which are almost continuous, with one small break. This outcome is to be expected, because the second machine has a much stronger CPU compared to its GPU, so it makes sense it would utilize that CPU for more segments. ## 3 Approaches The different CPU and GPU approaches I used are outlined in a previous paper [4], so I will only go over them briefly. There is one CPU approach, and three GPU ones, one to one, warp reduce, and block reduce. These four implementations each have strengths and weaknesses depending on the structure of the graph they are operating upon. The scheduler, previously discussed in section 2, can split the graph into segments, and find on which segments to use which approaches. With this information, it then generates a new heterogeneous function that runs these segments. In this heterogeneous version, the segments using the GPU approaches are run in sequence, each receiving the same input array, which they read from and write to. The CPU segments operate the same way and are run in parallel with the GPU ones. Once these are both complete, the function checks if any values were updated in any of the segments. If any have changed, the values from the GPU segments are copied into the CPU array and vice versa. This process is repeated until there are no updates. A breakdown of how the function utilizes its time using the command *nvprof* is shown below. ``` | Fig. | Files ``` This approach is somewhat problematic, because copying the memory is a slow process that can often take longer than the calculation phase. This is where one can see the advantages of an embedded device. Embedded devices have shared memory that can be accesses by the GPU and the CPU, which allows PR and SSSP to be run without any memory copying. This is done by first running the GPU segments, with the same input array as the previous approach, and then, once these are complete, passing the input array into the CPU segments. The *nvprof* results of this approach are shown below. From the results above, it is clear that Device to Host (DtoH) memory copying, the type that is being done at the end of each iteration, has been drastically reduced. The performance results of doing this vary between algorithms and are shown below in giga edges traversed per second (GTEPS). From the results, one can see that running SSSP without memory copying is about 1.13x faster than with, while PR is only .33x the speed. This is unsurprising, because the calculation phase in PR is much more complicated than that of SSSP. Because of this, the time increase of removing memory copying is much less significant than that lost by the removal of parallelization. ### 4 Device Limitations Looking at the previously described approach, one might wonder why the CPU and GPU segments are not run in parallel with one another. While this practice would most likely be significantly faster, it is unfortunately made impossible by the way the device defines a bus error. When the CPU and GPU are both given the same array, in shared memory, to read from and write to at the same time this triggers a bus error. Though unfortunate in this situation, it is understandable for the device to behave in such a way. What is much more problematic, however, is how the device handles the situation where the CPU and GPU are accessing different shared memory at the same time. In this situation, the actions of one processor should have no effect on the other, however, this is not the case. If the GPU kernel is running, the CPU cannot write to any array in shared memory. Worse still, it can read from an array in shared memory only if it is not writing to any memory, even its own. This is true even if the GPU is given an empty function, with no parameters and no code in the function body, which implies that when the GPU kernel launches, it locks up all memory it has access to. This is an incredibly suboptimal practice that, if removed, could lead to significant performance improvements. ### 5 Conclusion and Future Work The results show a small, but nonetheless significant, increase in the speed of the SSSP algorithm when run using the shared memory method over one that relies on memory copying. More importantly, they highlight a behavior in embedded devices that could be improved upon, although this is up to the devices designers to correct. If this flaw is eventually remedied, then future work on this subject could include an approach to both of these algorithms that utilizes both parallelization and shared memory, which could be significantly more effective than the implementations that are used today. # **Bibliography** - [1] JJ Bowen, Chris Liu, Tyler Sorensen & Sanya Srivastava "Hetero-Compute." GitHub, https://github.com/chrisliu/hetero-compute. - [2] Brin, S. & Page, L. (1998). The Anatomy of a Large-scale Hypertextual Web Search Engine. Comput. Netw. ISDN Syst., 30, 107–117. doi: 10.1016/S0169-7552(98)00110-X - [3] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C. & Ghahramani, Z. (2010). Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 11, 985–1042. - [4] JJ Bowen "A Heterogeneous CPU/GPU Implementation Of PageRank" 2022