presented at event 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA) Conference