HPCA

All

Cited by Paper title Year
1235 Amdahl’s Law in the multicore era. 2008
1022 Evaluating MapReduce for Multi-core and Multiprocessor Systems. 2007
770 LogTM: log-based transactional memory. 2006
617 System level analysis of fast, per-core DVFS using on-chip switching regulators. 2008
616 Unbounded Transactional Memory. 2005
589 Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. 2005
499 Power Efficient Processor Architecture and The Cell Processor. 2005
395 The Soft Error Problem: An Architectural Perspective. 2005
386 Graphite: A distributed parallel simulator for multicores. 2010
373 LogTM-SE: Decoupling Hardware Transactional Memory from Caches. 2007
348 A novel architecture of the 3D stacked MRAM L2 cache for CMPs. 2009
336 Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. 2008
318 ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. 2010
315 Regional congestion awareness for load balance in networks-on-chip. 2008
266 Dynamic power-performance adaptation of parallel computation on chip multiprocessors. 2006
262 Relaxing non-volatility for fast and energy-efficient STT-RAM caches. 2011
243 Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. 2007
231 CMP network-on-chip overlaid with multi-band RF-interconnect. 2008
229 Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing. 2010
229 A quantitative performance analysis model for GPU architectures. 2011
213 An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. 2010
211 Cluster-level feedback power control for performance optimization. 2008
206 BigDataBench: A big data benchmark suite from internet services. 2014
198 Chip Multithreading: Opportunities and Challenges. 2005
191 High performance network virtualization with SR-IOV. 2010
190 Concurrent Direct Network Access for Virtual Machine Monitors. 2007
183 Performance, Energy, and Thermal Considerations for SMT and CMP Architectures. 2005
183 Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. 2009
177 SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs. 2005
177 Dynamically Specialized Datapaths for energy efficient computing. 2011
176 BulletProof: a defect-tolerant CMP switch architecture. 2006
170 Express Cube Topologies for on-Chip Interconnects. 2009
169 CMP design space exploration subject to physical constraints. 2006
166 Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. 2011
164 Construction and use of linear regression models for processor performance analysis. 2006
159 Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors. 2007
154 Thread block compaction for efficient SIMT control flow. 2011
147 A Scalable, Non-blocking Approach to Transactional Memory. 2007
144 Application-Level Correctness and its Impact on Fault Tolerance. 2007
143 Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads. 2006
143 FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar. 2010
141 An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. 2007
140 I-CASH: Intelligently Coupled Array of SSD and HDD. 2011
138 FlexiTaint: A programmable accelerator for dynamic taint propagation. 2008
137 HARD: Hardware-Assisted Lockset-based Race Detection. 2007
136 A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks. 2005
136 FREE-p: Protecting non-volatile memory against both hard and soft errors. 2011
132 A comprehensive approach to DRAM power management. 2008
131 Operating system support for overlapping-ISA heterogeneous multi-core architectures. 2010
131 CHIPPER: A low-complexity bufferless deflection router. 2011
130 Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. 2013
127 Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM. 2006
125 The common case transactional behavior of multithreaded programs. 2006
125 A Burst Scheduling Access Reordering Mechanism. 2007
123 Adaptive Spill-Receive for robust high-performance caching in CMPs. 2009
120 Application performance modeling in a virtualized environment. 2010
119 Variation-aware dynamic voltage/frequency scaling. 2009
117 Transition Phase Classification and Prediction. 2005
117 Computational sprinting. 2012
115 Scalable architectural support for trusted software. 2010
113 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications. 2007
112 Phase characterization for power: evaluating control-flow-based and event-counter-based techniques. 2006
112 A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement. 2010
111 Characterizing and Comparing Prevailing Simulation Techniques. 2005
110 Cuckoo directory: A scalable directory for many-core systems. 2011
110 Beyond block I/O: Rethinking traditional storage primitives. 2011
109 Improving write operations in MLC phase change memory. 2012
107 C-Oracle: Predictive thermal management for data centers. 2008
107 Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. 2013
106 Elastic-buffer flow control for on-chip networks. 2009
105 Perturbation-based Fault Screening. 2007
104 Improving Multiple-CMP Systems Using Token Coherence. 2005
103 Uncovering hidden loop level parallelism in sequential applications. 2008
103 Designing a processor from the ground up to allow voltage/reliability tradeoffs. 2010
103 Tiered-latency DRAM: A low latency and low cost DRAM architecture. 2013
102 MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging. 2007
102 Balancing DRAM locality and parallelism in shared memory CMP systems. 2012
97 Illustrative Design Space Studies with Microarchitectural Regression Models. 2007
97 Compute Caches. 2017
95 Eliminating microarchitectural dependency from Architectural Vulnerability. 2009
95 Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. 2009
94 Prediction router: Yet another low latency on-chip router architecture. 2009
92 The case for GPGPU spatial multitasking. 2012
91 A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. 2005
91 CORD: cost-effective (and nearly overhead-free) order-recording and data race detection. 2006
91 Interval simulation: Raising the level of abstraction in architectural simulation. 2010
90 Checkpointed Early Load Retirement. 2005
90 CHOP: Adaptive filter-based DRAM caching for CMP server platforms. 2010
89 PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. 2009
89 Accurate microarchitecture-level fault modeling for studying hardware faults. 2009
88 Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. 2009
88 SCD: A scalable coherence directory with flexible sharer set encoding. 2012
88 TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. 2012
86 DMA-aware memory energy management. 2006
86 Exploiting parallelism and structure to accelerate the simulation of chip multi-processors. 2006
86 ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers. 2006
86 A low-radix and low-diameter 3D interconnection network design. 2009
86 CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. 2009
84 A Unified Compressed Memory Hierarchy. 2005
84 Performance and power optimization through data compression in Network-on-Chip architectures. 2008
84 Blueshift: Designing processors for timing speculation from the ground up. 2009
83 Towards scalable, energy-efficient, bus-based on-chip networks. 2010
81 Trends in High-Performance Processors. 2005
81 Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance. 2010
80 MISE: Providing performance predictability and improving fairness in shared main memory systems. 2013
80 Cache coherence for GPU architectures. 2013
79 Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling. 2007
79 CPU-assisted GPGPU on fused CPU-GPU architectures. 2012
79 High-performance and energy-efficient mobile web browsing on big/little systems. 2013
78 Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors. 2005
77 Understanding the performance-temperature interactions in disk I/O of server workloads. 2006
77 High performance file I/O for the Blue Gene/L supercomputer. 2006
76 A first-order fine-grained multithreaded throughput model. 2009
76 SolarCore: Solar energy driven multi-core architecture power management. 2011
76 ESESC: A fast multicore simulator using Time-Based Sampling. 2013
75 Distributing the Frontend for Temperature Reduction. 2005
75 HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. 2011
74 DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors. 2008
73 Bridging the computation gap between programmable processors and hardwired accelerators. 2009
73 Calvin: Deterministic or not? Free will to choose. 2011
73 MRPB: Memory request prioritization for massively parallel processors. 2014
72 Shared last-level TLBs for chip multiprocessors. 2011
72 Runnemede: An architecture for Ubiquitous High-Performance Computing. 2013
72 Accelerating write by exploiting PCM asymmetries. 2013
68 CloudCache: Expanding and shrinking private caches. 2011
68 Improving DRAM performance by parallelizing refreshes with accesses. 2014
67 SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors. 2005
66 Voltage emergency prediction: Using signatures to reduce operating margins. 2009
65 A Memory-Level Parallelism Aware Fetch Policy for SMT Processors. 2007
65 An OS-based alternative to full hardware coherence on tiled CMPs. 2008
65 Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. 2009
65 In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. 2009
65 Warped register file: A power efficient register file for GPGPUs. 2013
64 On the Limits of Leakage Power Reduction in Caches. 2005
64 Interactions Between Compression and Prefetching in Chip Multiprocessors. 2007
64 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines. 2007
64 Automated microprocessor stressmark generation. 2008
63 An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. 2007
63 Hardware-software integrated approaches to defend against software cache-based side channel attacks. 2009
62 Addressing system-level trimming issues in on-chip nanophotonic networks. 2011
62 A case for guarded power gating for multi-core processors. 2011
61 Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors. 2005
61 A Small, Fast and Low-Power Register File by Bit-Partitioning. 2005
61 Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system. 2011
60 Practical and secure PCM systems by online detection of malicious write streams. 2011
60 Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip. 2012
59 Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. 2005
59 Worth their watts? - an empirical study of datacenter servers. 2010
59 Reducing GPU offload latency via fine-grained CPU-GPU synchronization. 2013
58 Efficient scrub mechanisms for error-prone emerging memories. 2012
57 Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat. 2007
56 MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. 2009
56 Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics. 2009
56 Programming the cloud. 2011
56 Improving GPGPU resource utilization through alternative thread block scheduling. 2014
55 Application-to-core mapping policies to reduce memory system interference in multi-core systems. 2013
54 Enterprise IT Trends and Implications for Architecture Research. 2005
54 Archipelago: A polymorphic cache design for enabling robust near-threshold operation. 2011
54 Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. 2012
53 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. 2005
53 An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. 2005
53 Navigating heterogeneous processors with market mechanisms. 2013
52 iCFP: Tolerating all-level cache misses in in-order processors. 2009
52 Overcoming the challenges of crossbar resistive memory architectures. 2015
51 Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks. 2013
50 A case for Refresh Pausing in DRAM memory systems. 2013
50 Breaking the on-chip latency barrier using SMART. 2013
50 Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. 2015
49 InfoShield: a security architecture for protecting information usage in memory. 2006
49 Design and implementation of the blue gene/P snoop filter. 2008
49 Design and implementation of software-managed caches for multicores with local memory. 2009
49 Energy-efficient interconnect via Router Parking. 2013
49 Power-performance co-optimization of throughput core architecture using resistive memory. 2013
49 Architecture exploration for ambient energy harvesting nonvolatile processors. 2015
48 NUcache: An efficient multicore cache organization based on Next-Use distance. 2011
48 QuickIA: Exploring heterogeneous architectures on real prototypes. 2012
48 SNNAP: Approximate computing on programmable SoCs via neural acceleration. 2015
47 Scatter-Add in Data Parallel Architectures. 2005
47 Thread-safe dynamic binary translation using transactional memory. 2008
47 Dacota: Post-silicon validation of the memory subsystem in multi-core designs. 2009
47 Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. 2012
47 Optimizing virtual machine scheduling in NUMA multicore systems. 2013
47 i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations. 2013
46 Adaptive placement and migration policy for an STT-RAM-based hybrid cache. 2014
45 An approach for implementing efficient superscalar CISC processors. 2006
45 A new server I/O architecture for high speed networks. 2011
44 Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols. 2011
44 Fast thread migration via cache working set prediction. 2011
44 Dynamically heterogeneous cores through 3D resource pooling. 2012
44 Enabling distributed generation powered sustainable high-performance data center. 2013
44 A detailed GPU cache model based on reuse distance theory. 2014
43 Heat Stroke: Power-Density-Based Denial of Service in SMT. 2005
43 Coset coding to extend the lifetime of memory. 2013
43 NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. 2015
42 EXCES: External caching in energy saving storage systems. 2008
42 Simple virtual channel allocation for high throughput and high frequency on-chip routers. 2010
42 Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. 2012
41 Colorama: Architectural Support for Data-Centric Synchronization. 2007
41 MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy. 2011
41 Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications. 2012
41 Staged Reads: Mitigating the impact of DRAM writes on DRAM reads. 2012
41 EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing. 2013
41 DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture. 2014
41 Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs. 2014
41 Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. 2015
40 Accelerating and Adapting Precomputation Threads for Effcient Prefetching. 2007
40 Improving cache performance using read-write partitioning. 2014
40 MemZip: Exploring unconventional benefits from memory compression. 2014
40 Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. 2015
39 Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping. 2007
39 ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture. 2010
39 Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. 2011
39 AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. 2012
39 The dual-path execution model for efficient GPU control flow. 2013
38 A Domain-Specific On-Chip Network Design for Large Scale Cache Systems. 2007
38 Runtime validation of memory ordering using constraint graph checking. 2008
37 Efficient complex operators for irregular codes. 2011
37 System-level implications of disaggregated memory. 2012
37 Timing channel protection for a shared memory controller. 2014
37 Supporting x86-64 address translation for 100s of GPU lanes. 2014
36 Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures. 2007
36 An intelligent IT infrastructure for the future. 2009
36 Optimizing Google’s warehouse scale computers: The NUMA experience. 2013
36 Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. 2013
35 A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures. 2007
35 GPGPU performance and power estimation using machine learning. 2015
34 Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems. 2005
34 Supporting highly-decoupled thread-level redundancy for parallel programs. 2008
34 Reconciling specialization and flexibility through compound circuits. 2009
34 Fast complete memory consistency verification. 2009
34 Abstraction and microarchitecture scaling in early-stage power modeling. 2011
34 Disintegrated control for energy-efficient and heterogeneous memory systems. 2013
34 QuickRelease: A throughput-oriented approach to release consistency on GPUs. 2014
34 Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications. 2015
33 A decoupled KILO-instruction processor. 2006
33 Characterization of Direct Cache Access on multi-core systems and 10GbE. 2009
33 Power-efficient computing for compute-intensive GPGPU applications. 2013
33 Increasing TLB reach by exploiting clustering in page translations. 2014
32 A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems. 2010
32 Bloom Filter Guided Transaction Scheduling. 2011
32 Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies. 2013
31 Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. 2006
31 Fundamental performance constraints in horizontal fusion of in-order cores. 2008
30 UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. 2010
30 Hardware/software techniques for DRAM thermal management. 2011
30 Achieving uniform performance and maximizing throughput in the presence of heterogeneity. 2011
30 Efficient data streaming with on-chip accelerators: Opportunities and challenges. 2011
30 Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. 2015
29 DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. 2010
29 SCRAP: Architecture for signature-based protection from Code Reuse Attacks. 2013
29 Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules. 2014
29 Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning. 2014
29 Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers. 2014
29 Warp-level divergence in GPUs: Characterization, impact, and mitigation. 2014
28 Multithreaded Value Prediction. 2005
28 Increasing the cache efficiency by eliminating noise. 2006
28 Practical off-chip meta-data for temporal memory streaming. 2009
28 Explaining cache SER anomaly using DUE AVF measurement. 2010
28 MORSE: Multi-objective reconfigurable self-optimizing memory scheduler. 2012
28 Modeling performance variation due to cache sharing. 2013
28 Scaling towards kilo-core processors with asymmetric high-radix topologies. 2013
28 Dynamic management of TurboMode in modern multi-core chips. 2014
28 TSO-CC: Consistency directed cache coherence for TSO. 2014
27 Software Directed Issue Queue Power Reduction. 2005
27 Efficient instruction schedulers for SMT processors. 2006
27 Single-level integrity and confidentiality protection for distributed shared memory multiprocessors. 2008
27 ACCESS: Smart scheduling for asymmetric cache CMPs. 2011
27 Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors. 2011
26 Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE. 2005
26 Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses. 2005
26 High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation. 2008
26 Power-Efficient DRAM Speculation. 2008
26 Power shifting in Thrifty Interconnection Network. 2011
26 JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers. 2012
26 Statistical performance comparisons of computers. 2012
26 Exploiting thermal energy storage to reduce data center capital and operating expenses. 2014
26 Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. 2015
26 Exploiting compressed block size as an indicator of future reuse. 2015
26 Coordinated static and dynamic cache bypassing for GPUs. 2015
26 Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. 2015
25 Optical Interconnect Opportunities for Future Server Memory Systems. 2007
25 Data-triggered threads: Eliminating redundant computation. 2011
25 MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. 2014
25 Mascar: Speeding up GPU warps by reducing memory pitstops. 2015
25 CATalyst: Defeating last-level cache side channel attacks in cloud computing. 2016
24 Exploiting Postdominance for Speculative Parallelization. 2007
24 Address-branch correlation: A novel locality for long-latency hard-to-predict branches. 2008
24 BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution. 2010
24 ?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory. 2012
24 Layout-conscious random topologies for HPC off-chip interconnects. 2013
24 ECM: Effective Capacity Maximizer for high-performance compressed caching. 2013
24 NUAT: A non-uniform access time memory controller. 2014
24 Quantifying sources of error in McPAT and potential impacts on architectural studies. 2015
24 Power punch: Towards non-blocking power-gating of NoC routers. 2015
23 Completely verifying memory consistency of test program executions. 2006
23 Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation. 2008
23 Architectural Contesting. 2009
23 QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers. 2014
23 ChargeCache: Reducing DRAM latency by exploiting row access locality. 2016
22 Value Based BTB Indexing for indirect jump prediction. 2010
22 Storage free confidence estimation for the TAGE branch predictor. 2011
22 Power balanced pipelines. 2012
22 Network congestion avoidance through Speculative Reservation. 2012
22 Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. 2016
21 Software-hardware cooperative memory disambiguation. 2006
21 Improving Branch Prediction and Predicated Execution in Out-of-Order Processors. 2007
21 Runahead Threads to improve SMT performance. 2008
21 Decoupled dynamic cache segmentation. 2012
21 Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. 2015
20 Store vectors for scalable memory dependence prediction and scheduling. 2006
20 Roughness of microarchitectural design topologies and its implications for optimization. 2008
20 Offline symbolic analysis to infer Total Store Order. 2011
20 MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets. 2012
20 Cost effective data center servers. 2013
20 XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. 2015
19 Feedback mechanisms for improving probabilistic memory prefetching. 2009
19 Soft error vulnerability aware process variation mitigation. 2009
19 IADVS: On-demand performance for interactive applications. 2010
19 HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. 2011
19 Reducing the cost of persistence for nonvolatile heaps in end user devices. 2014
19 Concurrent and consistent virtual machine introspection with hardware transactional memory. 2014
19 CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture. 2014
19 Stash directory: A scalable directory for many-core coherence. 2014
19 Priority-based cache allocation in throughput processors. 2015
19 Prediction-based superpage-friendly TLB designs. 2015
19 Unlocking bandwidth for GPUs in CC-NUMA systems. 2015
19 Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. 2016
18 Prediction of CPU idle-busy activity pattern. 2008
18 Checked Load: Architectural support for JavaScript type-checking on mobile processors. 2011
18 WEST: Cloning data cache behavior using Stochastic Traces. 2012
18 Supporting efficient collective communication in NoCs. 2012
18 Pacman: Tolerating asymmetric data races with unintrusive hardware. 2012
18 Improving multi-core performance using mixed-cell cache architecture. 2013
18 Worm-Bubble Flow Control. 2013
18 Sprinkler: Maximizing resource utilization in many-chip solid state disks. 2014
18 PVCoherence: Designing flat coherence protocols for scalable verification. 2014
18 Supporting superpages in non-contiguous physical memory. 2015
18 BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing. 2015
17 Probabilistic counter updates for predictor hysteresis and stratification. 2006
17 LiteTM: Reducing transactional state overhead. 2010
17 Locality-aware data replication in the Last-Level Cache. 2014
17 Spare register aware prefetching for graph algorithms on GPUs. 2014
17 Implications of high energy proportional servers on cluster-wide energy proportionality. 2014
17 Practical data value speculation for future high-end processors. 2014
17 Talus: A simple way to remove cliffs in cache performance. 2015
17 Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. 2015
16 Criticality-based optimizations for efficient load processing. 2009
16 SIF: Overcoming the limitations of SIMD devices via implicit permutation. 2010
16 StimulusCache: Boosting performance of chip multiprocessors with excess cache. 2010
16 Delay-Hiding energy management mechanisms for DRAM. 2010
16 Network within a network approach to create a scalable high-radix router microarchitecture. 2012
16 Tag tables. 2015
15 Parabix: Boosting the efficiency of text processing on commodity processors. 2012
15 Cache restoration for highly partitioned virtualized systems. 2012
15 Exploring high-performance and energy proportional interface for phase change memory systems. 2013
15 Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. 2014
15 A scalable multi-path microarchitecture for efficient GPU control flow. 2014
15 CAFO: Cost aware flip optimization for asymmetric memories. 2015
15 Malware-aware processors: A framework for efficient online malware detection. 2015
14 Implications of Device Timing Variability on Full Chip Timing. 2007
14 PEEP: Exploiting predictability of memory dependences in SMT processors. 2008
14 Adaptive Reliability Chipkill Correct (ARCC). 2013
14 Precision-aware soft error protection for GPUs. 2014
14 Revolver: Processor architecture for power efficient loop execution. 2014
14 Understanding contention-based channels and using them for defense. 2015
13 Exascale computing: The challenges and opportunities in the next decade. 2010
13 Accelerating business analytics applications. 2012
13 Undersubscribed threading on clustered cache architectures. 2014
13 Domain knowledge based energy management in handhelds. 2015
13 Paying to save: Reducing cost of colocation data center via rewards. 2015
13 Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. 2016
12 Chip-multiprocessing and beyond. 2006
12 PaCo: Probability-based path confidence prediction. 2008
12 Adaptive Set-Granular Cooperative Caching. 2012
12 TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network. 2013
12 Dynamically detecting and tolerating IF-Condition Data Races. 2014
12 DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. 2014
12 Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management. 2014
12 Scaling distributed cache hierarchies through computation and data co-scheduling. 2015
11 Tapping ZettaRAMTMfor Low-Power Memory Systems. 2005
11 Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions. 2005
11 Skinflint DRAM system: Minimizing DRAM chip writes for low power. 2013
11 Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling. 2013
11 Accordion: Toward soft Near-Threshold Voltage Computing. 2014
11 3D stacking of high-performance processors. 2014
11 Augmenting low-latency HPC network with free-space optical links. 2015
11 TABLA: A unified template-based framework for accelerating statistical machine learning. 2016
10 Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again. 2008
10 Speculative instruction validation for performance-reliability trade-off. 2008
10 COMIC++: A software SVM system for heterogeneous multicore accelerator clusters. 2010
10 BulkSMT: Designing SMT processors for atomic-block execution. 2012
10 Illusionist: Transforming lightweight cores into aggressive cores on demand. 2013
10 Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches. 2013
10 STM: Cloning the spatial and temporal memory access behavior. 2014
10 Strategies for anticipating risk in heterogeneous system design. 2014
10 Overcoming far-end congestion in large-scale networks. 2015
10 Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. 2016
10 Energy-efficient address translation. 2016
9 Exploiting criticality to reduce bottlenecks in distributed uniprocessors. 2011
9 Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model. 2013
9 RECAP: A region-based cure for the common cold (cache). 2013
9 SCOC: High-radix switches made of bufferless clos networks. 2015
9 FTXen: Making hypervisor resilient to hardware faults on relaxed cores. 2015
9 Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. 2016
9 A performance analysis framework for optimizing OpenCL applications on FPGAs. 2016
9 HRL: Efficient and flexible reconfigurable logic for near-data processing. 2016
8 Performance-aware speculation control using wrong path usefulness prediction. 2008
8 Handling branches in TLS systems with Multi-Path Execution. 2010
8 Hardware/software-based diagnosis of load-store queues using expandable activity logs. 2011
8 Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons. 2013
8 A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks. 2014
8 Reducing read latency of phase change memory via early read and Turbo Read. 2015
8 Warped-preexecution: A GPU pre-execution approach for improving latency hiding. 2016
8 A case for toggle-aware compression for GPU systems. 2016
7 Architectural support for synchronization-free deterministic parallel programming. 2012
7 A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O. 2013
7 A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. 2013
7 Two level bulk preload branch prediction. 2013
7 High-speed formal verification of heterogeneous coherence hierarchies. 2013
7 Understanding the impact of gate-level physical reliability effects on whole program execution. 2014
7 Atomic SC for simple in-order processors. 2014
7 Transportation-network-inspired network-on-chip. 2014
7 FADE: A programmable filtering accelerator for instruction-grain monitoring. 2014
7 Exploring architectural heterogeneity in intelligent vision systems. 2015
7 GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures. 2015
7 BeBoP: A cost effective predictor infrastructure for superscalar value prediction. 2015
7 Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems. 2015
7 Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. 2016
7 A large-scale study of soft-errors on GPUs in the field. 2016
7 Atomic persistence for SCM with a non-intrusive backend controller. 2016
6 High-Performance low-vcc in-order core. 2010
6 Flexible register management using reference counting. 2012
6 In-network traffic regulation for Transactional Memory. 2013
6 iPatch: Intelligent fault patching to improve energy efficiency. 2015
6 Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability. 2015
6 Balancing reliability, cost, and performance tradeoffs with FreeFault. 2015
6 Selective GPU caches to eliminate CPU-GPU HW cache coherence. 2016
5 Speculative synchronization and thread management for fine granularity threads. 2006
5 Fabric convergence implications on systems architecture. 2008
5 HARE: Hardware assisted reverse execution. 2010
5 DMA++: on the fly data realignment for on-chip memories. 2010
5 Fg-STP: Fine-Grain Single Thread Partitioning on Multicores. 2011
5 Architectural framework for supporting operating system survivability. 2011
5 A group-commit mechanism for ROB-based processors implementing the X86 ISA. 2013
5 Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs. 2014
5 CDTT: Compiler-generated data-triggered threads. 2014
5 Scalably verifiable dynamic power management. 2014
5 GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. 2014
5 High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. 2015
5 Increasing multicore system efficiency through intelligent bandwidth shifting. 2015
5 “Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “ 2015
5 CiDRA: A cache-inspired DRAM resilience architecture. 2015
5 Scalable communication architecture for network-attached accelerators. 2015
5 VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors. 2015
5 Efficient footprint caching for Tagless DRAM Caches. 2016
5 A complete key recovery timing attack on a GPU. 2016
5 McVerSi: A test generation framework for fast memory consistency verification in simulation. 2016
5 Pushing the limits of accelerator efficiency while retaining programmability. 2016
5 Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller. 2016
5 Restore truncation for performance improvement in future DRAM systems. 2016
5 Modeling cache performance beyond LRU. 2016
5 SLaC: Stage laser control for a flattened butterfly network. 2016
4 Interconnect-Centric Computing. 2007
4 Branch-mispredict level parallelism (BLP) for control independence. 2008
4 LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores. 2010
4 BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks. 2012
4 Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor. 2012
4 How to implement effective prediction and forwarding for fusable dynamic multicore architectures. 2013
4 Correction prediction: Reducing error correction latency for on-chip memories. 2015
4 CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM. 2016
4 ScalCore: Designing a core for voltage scalability. 2016
4 Best-offset hardware prefetching. 2016
4 Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines. 2016
4 Towards high performance paged memory for GPUs. 2016
4 SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies. 2017
4 Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques. 2017
3 Petascale Computing Research Challenges - A Manycore Perspective. 2007
3 Lightweight predication support for out of order processors. 2009
3 MOPED: Orchestrating interprocess message data on CMPs. 2011
3 Safe and efficient supervised memory systems. 2011
3 Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee. 2016
3 LASER: Light, Accurate Sharing dEtection and Repair. 2016
3 A low power software-defined-radio baseband processor for the Internet of Things. 2016
3 Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems. 2016
3 Symbiotic job scheduling on the IBM POWER8. 2016
3 MaPU: A novel mathematical computing architecture. 2016
3 Transparent and Efficient CFI Enforcement with Intel Processor Trace. 2017
2 Industrial Perspectives: Platform Design Challenges with Many cores. 2006
2 Opportunities beyond single-core microprocessors. 2009
2 Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach. 2014
2 Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. 2015
2 Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems. 2015
2 Approximating warps with intra-warp operand value similarity. 2016
2 Software transparent dynamic binary translation for coarse-grain reconfigurable architectures. 2016
2 Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors. 2016
2 Cost effective physical register sharing. 2016
2 A low-power hybrid reconfigurable architecture for resistive random-access memories. 2016
2 LiveSim: Going live with microarchitecture simulation. 2016
2 Core tunneling: Variation-aware voltage noise mitigation in GPUs. 2016
2 Venice: Exploring server architectures for effective resource sharing. 2016
2 PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. 2017
1 Architecting for power management: The IBM POWER7TMapproach. 2010
1 Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors. 2013
1 Low-overhead and high coverage run-time race detection through selective meta-data management. 2014
1 DVFS for NoCs in CMPs: A thread voting approach. 2016
1 DUANG: Fast and lightweight page migration in asymmetric memory systems. 2016
1 PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory. 2016
1 Minimal disturbance placement and promotion. 2016
1 iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs. 2016
1 Efficient synthetic traffic models for large, complex SoCs. 2016
1 Efficient GPU hardware transactional memory through early conflict resolution. 2016
1 The runahead network-on-chip. 2016
1 RADAR: Runtime-assisted dead region management for last-level caches. 2016
1 SizeCap: Efficiently handling power surges in fuel cell powered data centers. 2016
1 A market approach for handling power emergencies in multi-tenant data center. 2016
1 Cooper: Task Colocation with Cooperative Games. 2017
1 Secure Dynamic Memory Scheduling Against Timing Channel Attacks. 2017
1 Controlled Kernel Launch for Dynamic Parallelism in GPUs. 2017
1 Exploring Hyperdimensional Associative Memory. 2017
1 SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization. 2017
1 ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging. 2017
1 MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories. 2017
1 Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs. 2017
1 Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. 2017
1 SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support. 2017
1 GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. 2017
1 Near-Ideal Networks-on-Chip for Servers. 2017
0 The Future of Computer Architecture Research: An Industrial Perspective. 2005
0 Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth. 2006
0 Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps. 2006
0 New architectures for a new biology. 2006
0 Intel’s Tera-scale Computing Project: The first five years, the next five years. 2008
0 Compilers and parallel computing systems. 2008
0 Industrial perspectives panel. 2009
0 Multi-core demands multi-interfaces. 2009
0 Is hardware innovation over? 2010
0 Extreme scale computing: Challenges and opportunities. 2010
0 How’s the parallel computing revolution going? 2011
0 Improving in-memory database index performance with Intel®Transactional Synchronization Extensions. 2014
0 Run-time monitoring with adjustable overhead using dataflow-guided filtering. 2015
0 Design and implementation of a mobile storage leveraging the DRAM interface. 2016
0 SCsafe: Logging sequential consistency violations continuously and precisely. 2016
0 PABST: Proportionally Allocated Bandwidth at the Source and Target. 2017
0 Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources. 2017
0 BRAVO: Balanced Reliability-Aware Voltage Optimization. 2017
0 Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads. 2017
0 Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links. 2017
0 Design and Analysis of an APU for Exascale Computing. 2017
0 Boomerang: A Metadata-Free Architecture for Control Flow Delivery. 2017
0 Partial Row Activation for Low-Power DRAM System. 2017
0 High-Bandwidth Low-Latency Approximate Interconnection Networks. 2017
0 Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence. 2017
0 Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies. 2017
0 Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings. 2017
0 Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks. 2017
0 Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors. 2017
0 Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor. 2017
0 Architecting an Energy-Efficient DRAM System for GPUs. 2017
0 Processing-in-Memory Enabled Graphics Processors for 3D Rendering. 2017
0 Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems. 2017
0 Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices. 2017
0 Random Folded Clos Topologies for Datacenter Networks. 2017
0 Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators. 2017
0 Enabling Effective Module-Oblivious Power Gating for Embedded Processors. 2017
0 Fast Decentralized Power Capping for Server Clusters. 2017
0 Maximizing Cache Performance Under Uncertainty. 2017
0 Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. 2017
0 Supporting Address Translation for Accelerator-Centric Architectures. 2017
0 G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. 2017
0 NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture. 2017
0 Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance. 2017
0 Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices. 2017
0 Pilot Register File: Energy Efficient Partitioned Register File for GPUs. 2017
0 FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. 2017
0 Reliability-Aware Scheduling on Heterogeneous Multicore Processors. 2017
0 KAML: A Flexible, High-Performance Key-Value SSD. 2017
0 A Split Cache Hierarchy for Enabling Data-Oriented Optimizations. 2017
0 Understanding and Optimizing Power Consumption in Memory Networks. 2017
0 SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures. 2017
0 Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking. 2017

2017

Cited by Paper title
97 Compute Caches.
4 SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.
4 Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.
3 Transparent and Efficient CFI Enforcement with Intel Processor Trace.
2 PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning.
1 Cooper: Task Colocation with Cooperative Games.
1 Secure Dynamic Memory Scheduling Against Timing Channel Attacks.
1 Controlled Kernel Launch for Dynamic Parallelism in GPUs.
1 Exploring Hyperdimensional Associative Memory.
1 SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization.
1 ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging.
1 MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories.
1 Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs.
1 Dynamic GPGPU Power Management Using Adaptive Model Predictive Control.
1 SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support.
1 GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks.
1 Near-Ideal Networks-on-Chip for Servers.
0 PABST: Proportionally Allocated Bandwidth at the Source and Target.
0 Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources.
0 BRAVO: Balanced Reliability-Aware Voltage Optimization.
0 Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads.
0 Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links.
0 Design and Analysis of an APU for Exascale Computing.
0 Boomerang: A Metadata-Free Architecture for Control Flow Delivery.
0 Partial Row Activation for Low-Power DRAM System.
0 High-Bandwidth Low-Latency Approximate Interconnection Networks.
0 Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence.
0 Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies.
0 Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings.
0 Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks.
0 Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors.
0 Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor.
0 Architecting an Energy-Efficient DRAM System for GPUs.
0 Processing-in-Memory Enabled Graphics Processors for 3D Rendering.
0 Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems.
0 Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices.
0 Random Folded Clos Topologies for Datacenter Networks.
0 Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators.
0 Enabling Effective Module-Oblivious Power Gating for Embedded Processors.
0 Fast Decentralized Power Capping for Server Clusters.
0 Maximizing Cache Performance Under Uncertainty.
0 Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures.
0 Supporting Address Translation for Accelerator-Centric Architectures.
0 G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs.
0 NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture.
0 Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance.
0 Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices.
0 Pilot Register File: Energy Efficient Partitioned Register File for GPUs.
0 FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks.
0 Reliability-Aware Scheduling on Heterogeneous Multicore Processors.
0 KAML: A Flexible, High-Performance Key-Value SSD.
0 A Split Cache Hierarchy for Enabling Data-Oriented Optimizations.
0 Understanding and Optimizing Power Consumption in Memory Networks.
0 SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures.
0 Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking.

2016

Cited by Paper title
25 CATalyst: Defeating last-level cache side channel attacks in cloud computing.
23 ChargeCache: Reducing DRAM latency by exploiting row access locality.
22 Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction.
19 Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.
13 Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning.
11 TABLA: A unified template-based framework for accelerating statistical machine learning.
10 Revisiting virtual L1 caches: A practical design using dynamic synonym remapping.
10 Energy-efficient address translation.
9 Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing.
9 A performance analysis framework for optimizing OpenCL applications on FPGAs.
9 HRL: Efficient and flexible reconfigurable logic for near-data processing.
8 Warped-preexecution: A GPU pre-execution approach for improving latency hiding.
8 A case for toggle-aware compression for GPU systems.
7 Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family.
7 A large-scale study of soft-errors on GPUs in the field.
7 Atomic persistence for SCM with a non-intrusive backend controller.
6 Selective GPU caches to eliminate CPU-GPU HW cache coherence.
5 Efficient footprint caching for Tagless DRAM Caches.
5 A complete key recovery timing attack on a GPU.
5 McVerSi: A test generation framework for fast memory consistency verification in simulation.
5 Pushing the limits of accelerator efficiency while retaining programmability.
5 Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller.
5 Restore truncation for performance improvement in future DRAM systems.
5 Modeling cache performance beyond LRU.
5 SLaC: Stage laser control for a flattened butterfly network.
4 CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM.
4 ScalCore: Designing a core for voltage scalability.
4 Best-offset hardware prefetching.
4 Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines.
4 Towards high performance paged memory for GPUs.
3 Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee.
3 LASER: Light, Accurate Sharing dEtection and Repair.
3 A low power software-defined-radio baseband processor for the Internet of Things.
3 Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems.
3 Symbiotic job scheduling on the IBM POWER8.
3 MaPU: A novel mathematical computing architecture.
2 Approximating warps with intra-warp operand value similarity.
2 Software transparent dynamic binary translation for coarse-grain reconfigurable architectures.
2 Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors.
2 Cost effective physical register sharing.
2 A low-power hybrid reconfigurable architecture for resistive random-access memories.
2 LiveSim: Going live with microarchitecture simulation.
2 Core tunneling: Variation-aware voltage noise mitigation in GPUs.
2 Venice: Exploring server architectures for effective resource sharing.
1 DVFS for NoCs in CMPs: A thread voting approach.
1 DUANG: Fast and lightweight page migration in asymmetric memory systems.
1 PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory.
1 Minimal disturbance placement and promotion.
1 iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs.
1 Efficient synthetic traffic models for large, complex SoCs.
1 Efficient GPU hardware transactional memory through early conflict resolution.
1 The runahead network-on-chip.
1 RADAR: Runtime-assisted dead region management for last-level caches.
1 SizeCap: Efficiently handling power surges in fuel cell powered data centers.
1 A market approach for handling power emergencies in multi-tenant data center.
0 Design and implementation of a mobile storage leveraging the DRAM interface.
0 SCsafe: Logging sequential consistency violations continuously and precisely.

2015

Cited by Paper title
52 Overcoming the challenges of crossbar resistive memory architectures.
50 Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.
49 Architecture exploration for ambient energy harvesting nonvolatile processors.
48 SNNAP: Approximate computing on programmable SoCs via neural acceleration.
43 NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules.
41 Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.
40 Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories.
35 GPGPU performance and power estimation using machine learning.
34 Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications.
30 Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting.
26 Understanding GPU errors on large-scale HPC systems and the implications for system design and operation.
26 Exploiting compressed block size as an indicator of future reuse.
26 Coordinated static and dynamic cache bypassing for GPUs.
26 Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory.
25 Mascar: Speeding up GPU warps by reducing memory pitstops.
24 Quantifying sources of error in McPAT and potential impacts on architectural studies.
24 Power punch: Towards non-blocking power-gating of NoC routers.
21 Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers.
20 XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures.
19 Priority-based cache allocation in throughput processors.
19 Prediction-based superpage-friendly TLB designs.
19 Unlocking bandwidth for GPUs in CC-NUMA systems.
18 Supporting superpages in non-contiguous physical memory.
18 BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing.
17 Talus: A simple way to remove cliffs in cache performance.
17 Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies.
16 Tag tables.
15 CAFO: Cost aware flip optimization for asymmetric memories.
15 Malware-aware processors: A framework for efficient online malware detection.
14 Understanding contention-based channels and using them for defense.
13 Domain knowledge based energy management in handhelds.
13 Paying to save: Reducing cost of colocation data center via rewards.
12 Scaling distributed cache hierarchies through computation and data co-scheduling.
11 Augmenting low-latency HPC network with free-space optical links.
10 Overcoming far-end congestion in large-scale networks.
9 SCOC: High-radix switches made of bufferless clos networks.
9 FTXen: Making hypervisor resilient to hardware faults on relaxed cores.
8 Reducing read latency of phase change memory via early read and Turbo Read.
7 Exploring architectural heterogeneity in intelligent vision systems.
7 GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures.
7 BeBoP: A cost effective predictor infrastructure for superscalar value prediction.
7 Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems.
6 iPatch: Intelligent fault patching to improve energy efficiency.
6 Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability.
6 Balancing reliability, cost, and performance tradeoffs with FreeFault.
5 High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches.
5 Increasing multicore system efficiency through intelligent bandwidth shifting.
5 “Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “
5 CiDRA: A cache-inspired DRAM resilience architecture.
5 Scalable communication architecture for network-attached accelerators.
5 VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors.
4 Correction prediction: Reducing error correction latency for on-chip memories.
2 Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis.
2 Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems.
0 Run-time monitoring with adjustable overhead using dataflow-guided filtering.

2014

Cited by Paper title
206 BigDataBench: A big data benchmark suite from internet services.
73 MRPB: Memory request prioritization for massively parallel processors.
68 Improving DRAM performance by parallelizing refreshes with accesses.
56 Improving GPGPU resource utilization through alternative thread block scheduling.
46 Adaptive placement and migration policy for an STT-RAM-based hybrid cache.
44 A detailed GPU cache model based on reuse distance theory.
41 DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture.
41 Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs.
40 Improving cache performance using read-write partitioning.
40 MemZip: Exploring unconventional benefits from memory compression.
37 Timing channel protection for a shared memory controller.
37 Supporting x86-64 address translation for 100s of GPU lanes.
34 QuickRelease: A throughput-oriented approach to release consistency on GPUs.
33 Increasing TLB reach by exploiting clustering in page translations.
29 Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules.
29 Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning.
29 Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers.
29 Warp-level divergence in GPUs: Characterization, impact, and mitigation.
28 Dynamic management of TurboMode in modern multi-core chips.
28 TSO-CC: Consistency directed cache coherence for TSO.
26 Exploiting thermal energy storage to reduce data center capital and operating expenses.
25 MP3: Minimizing performance penalty for power-gating of Clos network-on-chip.
24 NUAT: A non-uniform access time memory controller.
23 QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers.
19 Reducing the cost of persistence for nonvolatile heaps in end user devices.
19 Concurrent and consistent virtual machine introspection with hardware transactional memory.
19 CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture.
19 Stash directory: A scalable directory for many-core coherence.
18 Sprinkler: Maximizing resource utilization in many-chip solid state disks.
18 PVCoherence: Designing flat coherence protocols for scalable verification.
17 Locality-aware data replication in the Last-Level Cache.
17 Spare register aware prefetching for graph algorithms on GPUs.
17 Implications of high energy proportional servers on cluster-wide energy proportionality.
17 Practical data value speculation for future high-end processors.
15 Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks.
15 A scalable multi-path microarchitecture for efficient GPU control flow.
14 Precision-aware soft error protection for GPUs.
14 Revolver: Processor architecture for power efficient loop execution.
13 Undersubscribed threading on clustered cache architectures.
12 Dynamically detecting and tolerating IF-Condition Data Races.
12 DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead.
12 Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management.
11 Accordion: Toward soft Near-Threshold Voltage Computing.
11 3D stacking of high-performance processors.
10 STM: Cloning the spatial and temporal memory access behavior.
10 Strategies for anticipating risk in heterogeneous system design.
8 A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks.
7 Understanding the impact of gate-level physical reliability effects on whole program execution.
7 Atomic SC for simple in-order processors.
7 Transportation-network-inspired network-on-chip.
7 FADE: A programmable filtering accelerator for instruction-grain monitoring.
5 Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs.
5 CDTT: Compiler-generated data-triggered threads.
5 Scalably verifiable dynamic power management.
5 GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management.
2 Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach.
1 Low-overhead and high coverage run-time race detection through selective meta-data management.
0 Improving in-memory database index performance with Intel®Transactional Synchronization Extensions.

2013

Cited by Paper title
130 Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM.
107 Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures.
103 Tiered-latency DRAM: A low latency and low cost DRAM architecture.
80 MISE: Providing performance predictability and improving fairness in shared main memory systems.
80 Cache coherence for GPU architectures.
79 High-performance and energy-efficient mobile web browsing on big/little systems.
76 ESESC: A fast multicore simulator using Time-Based Sampling.
72 Runnemede: An architecture for Ubiquitous High-Performance Computing.
72 Accelerating write by exploiting PCM asymmetries.
65 Warped register file: A power efficient register file for GPGPUs.
59 Reducing GPU offload latency via fine-grained CPU-GPU synchronization.
55 Application-to-core mapping policies to reduce memory system interference in multi-core systems.
53 Navigating heterogeneous processors with market mechanisms.
51 Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks.
50 A case for Refresh Pausing in DRAM memory systems.
50 Breaking the on-chip latency barrier using SMART.
49 Energy-efficient interconnect via Router Parking.
49 Power-performance co-optimization of throughput core architecture using resistive memory.
47 Optimizing virtual machine scheduling in NUMA multicore systems.
47 i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations.
44 Enabling distributed generation powered sustainable high-performance data center.
43 Coset coding to extend the lifetime of memory.
41 EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing.
39 The dual-path execution model for efficient GPU control flow.
36 Optimizing Google’s warehouse scale computers: The NUMA experience.
36 Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound.
34 Disintegrated control for energy-efficient and heterogeneous memory systems.
33 Power-efficient computing for compute-intensive GPGPU applications.
32 Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies.
29 SCRAP: Architecture for signature-based protection from Code Reuse Attacks.
28 Modeling performance variation due to cache sharing.
28 Scaling towards kilo-core processors with asymmetric high-radix topologies.
24 Layout-conscious random topologies for HPC off-chip interconnects.
24 ECM: Effective Capacity Maximizer for high-performance compressed caching.
20 Cost effective data center servers.
18 Improving multi-core performance using mixed-cell cache architecture.
18 Worm-Bubble Flow Control.
15 Exploring high-performance and energy proportional interface for phase change memory systems.
14 Adaptive Reliability Chipkill Correct (ARCC).
12 TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network.
11 Skinflint DRAM system: Minimizing DRAM chip writes for low power.
11 Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling.
10 Illusionist: Transforming lightweight cores into aggressive cores on demand.
10 Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches.
9 Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model.
9 RECAP: A region-based cure for the common cold (cache).
8 Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons.
7 A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O.
7 A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments.
7 Two level bulk preload branch prediction.
7 High-speed formal verification of heterogeneous coherence hierarchies.
6 In-network traffic regulation for Transactional Memory.
5 A group-commit mechanism for ROB-based processors implementing the X86 ISA.
4 How to implement effective prediction and forwarding for fusable dynamic multicore architectures.
1 Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors.

2012

Cited by Paper title
117 Computational sprinting.
109 Improving write operations in MLC phase change memory.
102 Balancing DRAM locality and parallelism in shared memory CMP systems.
92 The case for GPGPU spatial multitasking.
88 SCD: A scalable coherence directory with flexible sharer set encoding.
88 TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture.
79 CPU-assisted GPGPU on fused CPU-GPU architectures.
60 Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip.
58 Efficient scrub mechanisms for error-prone emerging memories.
54 Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips.
48 QuickIA: Exploring heterogeneous architectures on real prototypes.
47 Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs.
44 Dynamically heterogeneous cores through 3D resource pooling.
42 Design, integration and implementation of the DySER hardware accelerator into OpenSPARC.
41 Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications.
41 Staged Reads: Mitigating the impact of DRAM writes on DRAM reads.
39 AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture.
37 System-level implications of disaggregated memory.
28 MORSE: Multi-objective reconfigurable self-optimizing memory scheduler.
26 JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers.
26 Statistical performance comparisons of computers.
24 ?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory.
22 Power balanced pipelines.
22 Network congestion avoidance through Speculative Reservation.
21 Decoupled dynamic cache segmentation.
20 MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets.
18 WEST: Cloning data cache behavior using Stochastic Traces.
18 Supporting efficient collective communication in NoCs.
18 Pacman: Tolerating asymmetric data races with unintrusive hardware.
16 Network within a network approach to create a scalable high-radix router microarchitecture.
15 Parabix: Boosting the efficiency of text processing on commodity processors.
15 Cache restoration for highly partitioned virtualized systems.
13 Accelerating business analytics applications.
12 Adaptive Set-Granular Cooperative Caching.
10 BulkSMT: Designing SMT processors for atomic-block execution.
7 Architectural support for synchronization-free deterministic parallel programming.
6 Flexible register management using reference counting.
4 BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks.
4 Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor.

2011

Cited by Paper title
262 Relaxing non-volatility for fast and energy-efficient STT-RAM caches.
229 A quantitative performance analysis model for GPU architectures.
177 Dynamically Specialized Datapaths for energy efficient computing.
166 Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing.
154 Thread block compaction for efficient SIMT control flow.
140 I-CASH: Intelligently Coupled Array of SSD and HDD.
136 FREE-p: Protecting non-volatile memory against both hard and soft errors.
131 CHIPPER: A low-complexity bufferless deflection router.
110 Cuckoo directory: A scalable directory for many-core systems.
110 Beyond block I/O: Rethinking traditional storage primitives.
76 SolarCore: Solar energy driven multi-core architecture power management.
75 HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing.
73 Calvin: Deterministic or not? Free will to choose.
72 Shared last-level TLBs for chip multiprocessors.
68 CloudCache: Expanding and shrinking private caches.
62 Addressing system-level trimming issues in on-chip nanophotonic networks.
62 A case for guarded power gating for multi-core processors.
61 Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system.
60 Practical and secure PCM systems by online detection of malicious write streams.
56 Programming the cloud.
54 Archipelago: A polymorphic cache design for enabling robust near-threshold operation.
48 NUcache: An efficient multicore cache organization based on Next-Use distance.
45 A new server I/O architecture for high speed networks.
44 Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols.
44 Fast thread migration via cache working set prediction.
41 MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy.
39 Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism.
37 Efficient complex operators for irregular codes.
34 Abstraction and microarchitecture scaling in early-stage power modeling.
32 Bloom Filter Guided Transaction Scheduling.
30 Hardware/software techniques for DRAM thermal management.
30 Achieving uniform performance and maximizing throughput in the presence of heterogeneity.
30 Efficient data streaming with on-chip accelerators: Opportunities and challenges.
27 ACCESS: Smart scheduling for asymmetric cache CMPs.
27 Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors.
26 Power shifting in Thrifty Interconnection Network.
25 Data-triggered threads: Eliminating redundant computation.
22 Storage free confidence estimation for the TAGE branch predictor.
20 Offline symbolic analysis to infer Total Store Order.
19 HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor.
18 Checked Load: Architectural support for JavaScript type-checking on mobile processors.
9 Exploiting criticality to reduce bottlenecks in distributed uniprocessors.
8 Hardware/software-based diagnosis of load-store queues using expandable activity logs.
5 Fg-STP: Fine-Grain Single Thread Partitioning on Multicores.
5 Architectural framework for supporting operating system survivability.
3 MOPED: Orchestrating interprocess message data on CMPs.
3 Safe and efficient supervised memory systems.
0 How’s the parallel computing revolution going?

2010

Cited by Paper title
386 Graphite: A distributed parallel simulator for multicores.
318 ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers.
229 Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing.
213 An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth.
191 High performance network virtualization with SR-IOV.
143 FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar.
131 Operating system support for overlapping-ISA heterogeneous multi-core architectures.
120 Application performance modeling in a virtualized environment.
115 Scalable architectural support for trusted software.
112 A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement.
103 Designing a processor from the ground up to allow voltage/reliability tradeoffs.
91 Interval simulation: Raising the level of abstraction in architectural simulation.
90 CHOP: Adaptive filter-based DRAM caching for CMP server platforms.
83 Towards scalable, energy-efficient, bus-based on-chip networks.
81 Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance.
59 Worth their watts? - an empirical study of datacenter servers.
42 Simple virtual channel allocation for high throughput and high frequency on-chip routers.
39 ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture.
32 A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems.
30 UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all.
29 DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance.
28 Explaining cache SER anomaly using DUE AVF measurement.
24 BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution.
22 Value Based BTB Indexing for indirect jump prediction.
19 IADVS: On-demand performance for interactive applications.
17 LiteTM: Reducing transactional state overhead.
16 SIF: Overcoming the limitations of SIMD devices via implicit permutation.
16 StimulusCache: Boosting performance of chip multiprocessors with excess cache.
16 Delay-Hiding energy management mechanisms for DRAM.
13 Exascale computing: The challenges and opportunities in the next decade.
10 COMIC++: A software SVM system for heterogeneous multicore accelerator clusters.
8 Handling branches in TLS systems with Multi-Path Execution.
6 High-Performance low-vcc in-order core.
5 HARE: Hardware assisted reverse execution.
5 DMA++: on the fly data realignment for on-chip memories.
4 LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores.
1 Architecting for power management: The IBM POWER7TMapproach.
0 Is hardware innovation over?
0 Extreme scale computing: Challenges and opportunities.

2009

Cited by Paper title
348 A novel architecture of the 3D stacked MRAM L2 cache for CMPs.
183 Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs.
170 Express Cube Topologies for on-Chip Interconnects.
123 Adaptive Spill-Receive for robust high-performance caching in CMPs.
119 Variation-aware dynamic voltage/frequency scaling.
106 Elastic-buffer flow control for on-chip networks.
95 Eliminating microarchitectural dependency from Architectural Vulnerability.
95 Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches.
94 Prediction router: Yet another low latency on-chip router architecture.
89 PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches.
89 Accurate microarchitecture-level fault modeling for studying hardware faults.
88 Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems.
86 A low-radix and low-diameter 3D interconnection network design.
86 CAMP: A technique to estimate per-structure power at run-time using a few simple parameters.
84 Blueshift: Designing processors for timing speculation from the ground up.
76 A first-order fine-grained multithreaded throughput model.
73 Bridging the computation gap between programmable processors and hardwired accelerators.
66 Voltage emergency prediction: Using signatures to reduce operating margins.
65 Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy.
65 In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects.
63 Hardware-software integrated approaches to defend against software cache-based side channel attacks.
56 MRR: Enabling fully adaptive multicast routing for CMP interconnection networks.
56 Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics.
52 iCFP: Tolerating all-level cache misses in in-order processors.
49 Design and implementation of software-managed caches for multicores with local memory.
47 Dacota: Post-silicon validation of the memory subsystem in multi-core designs.
36 An intelligent IT infrastructure for the future.
34 Reconciling specialization and flexibility through compound circuits.
34 Fast complete memory consistency verification.
33 Characterization of Direct Cache Access on multi-core systems and 10GbE.
28 Practical off-chip meta-data for temporal memory streaming.
23 Architectural Contesting.
19 Feedback mechanisms for improving probabilistic memory prefetching.
19 Soft error vulnerability aware process variation mitigation.
16 Criticality-based optimizations for efficient load processing.
3 Lightweight predication support for out of order processors.
2 Opportunities beyond single-core microprocessors.
0 Industrial perspectives panel.
0 Multi-core demands multi-interfaces.

2008

Cited by Paper title
1235 Amdahl’s Law in the multicore era.
617 System level analysis of fast, per-core DVFS using on-chip switching regulators.
336 Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems.
315 Regional congestion awareness for load balance in networks-on-chip.
231 CMP network-on-chip overlaid with multi-band RF-interconnect.
211 Cluster-level feedback power control for performance optimization.
138 FlexiTaint: A programmable accelerator for dynamic taint propagation.
132 A comprehensive approach to DRAM power management.
107 C-Oracle: Predictive thermal management for data centers.
103 Uncovering hidden loop level parallelism in sequential applications.
84 Performance and power optimization through data compression in Network-on-Chip architectures.
74 DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors.
65 An OS-based alternative to full hardware coherence on tiled CMPs.
64 Automated microprocessor stressmark generation.
49 Design and implementation of the blue gene/P snoop filter.
47 Thread-safe dynamic binary translation using transactional memory.
42 EXCES: External caching in energy saving storage systems.
38 Runtime validation of memory ordering using constraint graph checking.
34 Supporting highly-decoupled thread-level redundancy for parallel programs.
31 Fundamental performance constraints in horizontal fusion of in-order cores.
27 Single-level integrity and confidentiality protection for distributed shared memory multiprocessors.
26 High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation.
26 Power-Efficient DRAM Speculation.
24 Address-branch correlation: A novel locality for long-latency hard-to-predict branches.
23 Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation.
21 Runahead Threads to improve SMT performance.
20 Roughness of microarchitectural design topologies and its implications for optimization.
18 Prediction of CPU idle-busy activity pattern.
14 PEEP: Exploiting predictability of memory dependences in SMT processors.
12 PaCo: Probability-based path confidence prediction.
10 Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again.
10 Speculative instruction validation for performance-reliability trade-off.
8 Performance-aware speculation control using wrong path usefulness prediction.
5 Fabric convergence implications on systems architecture.
4 Branch-mispredict level parallelism (BLP) for control independence.
0 Intel’s Tera-scale Computing Project: The first five years, the next five years.
0 Compilers and parallel computing systems.

2007

Cited by Paper title
1022 Evaluating MapReduce for Multi-core and Multiprocessor Systems.
373 LogTM-SE: Decoupling Hardware Transactional Memory from Caches.
243 Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers.
190 Concurrent Direct Network Access for Virtual Machine Monitors.
159 Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors.
147 A Scalable, Non-blocking Approach to Transactional Memory.
144 Application-Level Correctness and its Impact on Fault Tolerance.
141 An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors.
137 HARD: Hardware-Assisted Lockset-based Race Detection.
125 A Burst Scheduling Access Reordering Mechanism.
113 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications.
105 Perturbation-based Fault Screening.
102 MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging.
97 Illustrative Design Space Studies with Microarchitectural Regression Models.
79 Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling.
65 A Memory-Level Parallelism Aware Fetch Policy for SMT Processors.
64 Interactions Between Compression and Prefetching in Chip Multiprocessors.
64 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines.
63 An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing.
57 Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat.
41 Colorama: Architectural Support for Data-Centric Synchronization.
40 Accelerating and Adapting Precomputation Threads for Effcient Prefetching.
39 Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping.
38 A Domain-Specific On-Chip Network Design for Large Scale Cache Systems.
36 Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures.
35 A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures.
25 Optical Interconnect Opportunities for Future Server Memory Systems.
24 Exploiting Postdominance for Speculative Parallelization.
21 Improving Branch Prediction and Predicated Execution in Out-of-Order Processors.
14 Implications of Device Timing Variability on Full Chip Timing.
4 Interconnect-Centric Computing.
3 Petascale Computing Research Challenges - A Manycore Perspective.

2006

Cited by Paper title
770 LogTM: log-based transactional memory.
266 Dynamic power-performance adaptation of parallel computation on chip multiprocessors.
176 BulletProof: a defect-tolerant CMP switch architecture.
169 CMP design space exploration subject to physical constraints.
164 Construction and use of linear regression models for processor performance analysis.
143 Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads.
127 Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM.
125 The common case transactional behavior of multithreaded programs.
112 Phase characterization for power: evaluating control-flow-based and event-counter-based techniques.
91 CORD: cost-effective (and nearly overhead-free) order-recording and data race detection.
86 DMA-aware memory energy management.
86 Exploiting parallelism and structure to accelerate the simulation of chip multi-processors.
86 ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers.
77 Understanding the performance-temperature interactions in disk I/O of server workloads.
77 High performance file I/O for the Blue Gene/L supercomputer.
49 InfoShield: a security architecture for protecting information usage in memory.
45 An approach for implementing efficient superscalar CISC processors.
33 A decoupled KILO-instruction processor.
31 Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors.
28 Increasing the cache efficiency by eliminating noise.
27 Efficient instruction schedulers for SMT processors.
23 Completely verifying memory consistency of test program executions.
21 Software-hardware cooperative memory disambiguation.
20 Store vectors for scalable memory dependence prediction and scheduling.
17 Probabilistic counter updates for predictor hysteresis and stratification.
12 Chip-multiprocessing and beyond.
5 Speculative synchronization and thread management for fine granularity threads.
2 Industrial Perspectives: Platform Design Challenges with Many cores.
0 Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth.
0 Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps.
0 New architectures for a new biology.

2005

Cited by Paper title
616 Unbounded Transactional Memory.
589 Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture.
499 Power Efficient Processor Architecture and The Cell Processor.
395 The Soft Error Problem: An Architectural Perspective.
198 Chip Multithreading: Opportunities and Challenges.
183 Performance, Energy, and Thermal Considerations for SMT and CMP Architectures.
177 SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs.
136 A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks.
117 Transition Phase Classification and Prediction.
111 Characterizing and Comparing Prevailing Simulation Techniques.
104 Improving Multiple-CMP Systems Using Token Coherence.
91 A Performance Comparison of DRAM Memory System Optimizations for SMT Processors.
90 Checkpointed Early Load Retirement.
84 A Unified Compressed Memory Hierarchy.
81 Trends in High-Performance Processors.
78 Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors.
75 Distributing the Frontend for Temperature Reduction.
67 SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors.
64 On the Limits of Leakage Power Reduction in Caches.
61 Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors.
61 A Small, Fast and Low-Power Register File by Bit-Partitioning.
59 Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications.
54 Enterprise IT Trends and Implications for Architecture Research.
53 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures.
53 An Efficient Programmable 10 Gigabit Ethernet Network Interface Card.
47 Scatter-Add in Data Parallel Architectures.
43 Heat Stroke: Power-Density-Based Denial of Service in SMT.
34 Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems.
28 Multithreaded Value Prediction.
27 Software Directed Issue Queue Power Reduction.
26 Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE.
26 Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses.
11 Tapping ZettaRAMTMfor Low-Power Memory Systems.
11 Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions.
0 The Future of Computer Architecture Research: An Industrial Perspective.