ISCA

All

Cited by Paper title Year
1553 Power provisioning for a warehouse-sized computer. 2007
1203 Dark silicon and the end of multicore scaling. 2011
937 Scalable high performance main memory system using phase-change memory technology. 2009
875 Architecting phase change memory as a scalable dram alternative. 2009
756 Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. 2010
644 A durable and energy efficient main memory using phase change memory technology. 2009
625 Corona: System Implications of Emerging Nanophotonic Technology. 2008
610 Continuous Optimization. 2005
588 3D-Stacked Memory Architectures for Multi-core Processors. 2008
547 Adaptive insertion policies for high performance caching. 2007
539 Techniques for Multicore Thermal Management: Classification and New Exploration. 2006
532 An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. 2009
511 Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. 2005
497 Virtualizing Transactional Memory. 2005
477 Cooperative Caching for Chip Multiprocessors. 2006
451 Anton, a special-purpose machine for molecular dynamics simulation. 2007
448 Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. 2008
427 Design and Management of 3D Chip Multiprocessors Using Network-in-Memory. 2006
413 High performance cache replacement using re-reference interval prediction (RRIP). 2010
411 Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. 2005
389 An integrated GPU power and performance model. 2010
380 Reactive NUCA: near-optimal block placement and replication in distributed caches. 2009
376 Ensemble-level Power Management for Dense Blade Servers. 2006
368 Express virtual channels: towards the ideal interconnection fabric. 2007
367 Technology-Driven, Highly-Scalable Dragonfly Topology. 2008
365 An effective hybrid transactional memory system with strong isolation guarantees. 2007
363 Flattened butterfly: a cost-efficient topology for high-radix networks. 2007
361 Energy proportional datacenter networks. 2010
353 A reconfigurable fabric for accelerating large-scale datacenter services. 2014
341 BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. 2005
341 Optimizing Replication, Communication, and Capacity Allocation in CMPs. 2005
337 Firefly: illuminating future network-on-chip with nanophotonics. 2009
334 A High Throughput String Matching Architecture for Intrusion Detection and Prevention. 2005
327 Bulk Disambiguation of Speculative Threads in Multiprocessors. 2006
318 Understanding sources of inefficiency in general-purpose chips. 2010
316 A case for bufferless routing in on-chip networks. 2009
306 Hybrid cache architecture with disparate memory technologies. 2009
305 The Impact of Performance Asymmetry in Emerging Multicore Architectures. 2005
295 Power management of online data-intensive services. 2011
294 Core fusion: accommodating software diversity in chip multiprocessors. 2007
292 Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors. 2008
290 Improving NAND Flash Based Disk Caches. 2008
289 A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. 2006
288 Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. 2008
285 Raksha: a flexible information flow architecture for software security. 2007
272 GPUWattch: enabling energy optimizations in GPGPUs. 2013
267 A Case for MLP-Aware Cache Replacement. 2006
267 PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. 2009
265 A novel dimensionally-decomposed router for on-chip communication in 3D architectures. 2007
263 Mitigating Amdahl’s Law through EPI Throttling. 2005
263 New cache designs for thwarting software cache-based side channel attacks. 2007
258 Performance pathologies in hardware transactional memory. 2007
254 NoHype: virtualized cloud infrastructure without the virtualization. 2010
253 SODA: A Low-power Architecture For Software Radio. 2006
238 Hardware support for WCET analysis of hard real-time multicore systems. 2009
236 Scaling the bandwidth wall: challenges in and avenues for CMP scaling. 2009
236 RAIDR: Retention-aware intelligent DRAM refresh. 2012
232 Microarchitecture of a High-Radix Router. 2005
232 Use ECP, not ECC, for hard failures in resistive memories. 2010
230 Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions. 2005
229 A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks. 2006
227 Exploiting Structural Duplication for Lifetime Reliability Enhancement. 2005
225 Carbon: architectural support for fine-grained parallelism on chip multiprocessors. 2007
225 Trading off Cache Capacity for Reliability to Enable Low Voltage Operation. 2008
225 Thread motion: fine-grained power management for multi-core systems. 2009
223 BulkSC: bulk enforcement of sequential consistency. 2007
222 Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support. 2008
221 MIRA: A Multi-layered On-Chip Interconnect Router Architecture. 2008
220 The V-Way Cache: Demand Based Associativity via Global Replacement. 2005
217 Architectural Semantics for Practical Transactional Memory. 2006
217 DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently. 2008
217 A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. 2008
216 Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. 2008
215 Computing Architectural Vulnerability Factors for Address-Based Structures. 2005
211 Architecture for Protecting Critical Secrets in Microprocessors. 2005
211 Rethinking DRAM design and organization for energy-constrained multi-cores. 2010
208 The BlackWidow High-Radix Clos Network. 2006
207 Virtual hierarchies to support server consolidation. 2007
205 Scheduling heterogeneous multi-cores through performance impact estimation (PIE). 2012
195 Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks. 2005
195 An Ultra Low Power System Architecture for Sensor Network Applications. 2005
195 Relax: an architectural framework for software recovery of hardware faults. 2010
192 Configurable isolation: building high availability systems with commodity multi-core processors. 2007
192 Virtual private caches. 2007
191 RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. 2005
190 The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization. 2009
189 Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. 2010
186 Dynamic warp subdivision for integrated branch and memory divergence tolerance. 2010
179 Rerun: Exploiting Episodes for Lightweight Memory Race Recording. 2008
178 An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors. 2005
175 Energy-efficient mechanisms for managing thread context in throughput processors. 2011
174 Temperature-constrained power control for chip multiprocessors with online model estimation. 2009
173 Reducing cache power with low-cost, multi-bit error-correcting codes. 2010
172 Web search using mobile cores: quantifying and mitigating the price of efficiency. 2010
172 The impact of memory subsystem resource sharing on datacenter applications. 2011
170 Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. 2009
169 Phastlane: a rapid transit optical routing network. 2009
165 Spatial Memory Streaming. 2006
162 Making the fast case common and the uncommon case simple in unbounded transactional memory. 2007
160 Flexible Decoupled Transactional Memory Support. 2008
160 Benefits and limitations of tapping into stored energy for datacenters. 2011
159 Rigel: an architecture and scalable programming interface for a 1000-core accelerator. 2009
157 Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. 2007
157 Vantage: scalable and efficient fine-grain cache partitioning. 2011
154 Disaggregated memory for expansion and sharing in blade servers. 2009
153 ReCycle: : pipeline adaptation to tolerate process variation. 2007
150 Design and Evaluation of Hybrid Fault-Detection Systems. 2005
150 Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. 2008
148 Opportunistic Transient-Fault Detection. 2005
148 Direct Cache Access for High Bandwidth Network I/O. 2005
145 Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. 2012
144 Towards energy-proportional datacenter memory with mobile DRAM. 2012
143 Morphable memory system: a robust architecture for exploiting multi-level phase change memories. 2010
143 ZSim: fast and accurate microarchitectural simulation of thousand-core systems. 2013
141 Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing. 2010
140 A case for exploiting subarray-level parallelism (SALP) in DRAM. 2012
139 Improving Cost, Performance, and Security of Memory Encryption and Authentication. 2006
139 Aérgia: exploiting packet latency slack in on-chip networks. 2010
138 An integrated hardware-software approach to flexible transactional memory. 2007
137 Limiting the power consumption of main memory. 2007
137 Achieving predictable performance through better memory controller placement in many-core CMPs. 2009
137 A case for an interleaving constrained shared-memory multi-processor. 2009
136 Scale-out processors. 2012
134 TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. 2008
134 Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. 2011
132 Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. 2011
131 Interconnect-Aware Coherence Protocols for Chip Multiprocessors. 2006
131 Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. 2014
130 Flexible Hardware Acceleration for Instruction-Grain Program Monitoring. 2008
127 Comparing memory systems for chip multiprocessors. 2007
125 A Robust Main-Memory Compression Scheme. 2005
125 Dynamic prediction of architectural vulnerability from microarchitectural state. 2007
124 Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. 2005
124 Architectural core salvaging in a multi-core processor for hard-error tolerance. 2009
124 A dynamically configurable coprocessor for convolutional neural networks. 2010
124 Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. 2010
123 Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers. 2013
122 Temporal Streaming of Shared Memory. 2005
121 Thin servers with smart pipes: designing SoC accelerators for memcached. 2013
120 Learning-Based SMT Processor Resource Distribution via Hill-Climbing. 2006
120 TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time. 2006
120 FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. 2011
119 An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms. 2013
118 Analysis of the O-GEometric History Length Branch Predictor. 2005
118 Atom-Aid: Detecting and Surviving Atomicity Violations. 2008
118 Managing distributed UPS energy for effective power capping in data centers. 2012
114 DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip. 2011
113 Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory. 2008
113 VEAL: Virtualized Execution Accelerator for Loops. 2008
111 Energy-efficient cache design using variable-strength error-correcting codes. 2011
110 Translation caching: skip, don’t walk (the page table). 2010
109 Energy Optimization of Subthreshold-Voltage Sensor Network Processors. 2005
109 Interconnect design considerations for large NUCA caches. 2007
107 Efficient virtual memory for big memory servers. 2013
106 Memory mapped ECC: low-cost error protection for last level caches. 2009
106 PreSET: Improving performance of phase change memories by exploiting asymmetry in write times. 2012
105 Convolution engine: balancing efficiency&flexibility in specialized computing. 2013
103 Examining ACE analysis reliability estimates using fault-injection. 2007
101 Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors. 2005
101 Towards energy proportionality for large-scale latency-critical workloads. 2014
100 SigRace: signature-based data race detection. 2009
100 Re-architecting DRAM memory systems with monolithically integrated silicon photonics. 2010
99 High Efficiency Counter Mode Security Architecture via Prediction and Precomputation. 2005
99 Mechanisms for store-wait-free multiprocessors. 2007
99 The impact of management operations on the virtualized datacenter. 2010
99 SieveStore: a highly-selective, ensemble-level disk cache for cost-performance. 2010
98 A Tree Based Router Search Engine Architecture with Single Port Memories. 2005
98 Scalable power control for many-core architectures running multi-threaded applications. 2011
98 Prefetch-aware shared resource management for multi-core systems. 2011
98 A scalable processing-in-memory accelerator for parallel graph processing. 2015
97 Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. 2011
96 MetaTM//TxLinux: transactional memory for an operating system. 2007
96 Spatio-temporal memory streaming. 2009
96 Orchestrated scheduling and prefetching for GPGPUs. 2013
95 Piecewise Linear Branch Prediction. 2005
95 Robust architectural support for transactional memory in the power architecture. 2013
95 General-purpose code acceleration with limited-precision analog computation. 2014
94 AnySP: anytime anywhere anyway signal processing. 2009
94 Modeling critical sections in Amdahl’s law and its implications for multicore design. 2010
94 Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. 2013
91 A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime. 2008
90 Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. 2007
90 InvisiFence: performance-transparent memory ordering in conventional multiprocessors. 2009
89 Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. 2010
88 Rotary router: an efficient architecture for CMP interconnection networks. 2007
88 The virtual write queue: coordinating DRAM and last-level cache policies. 2010
88 A case for heterogeneous on-chip interconnects for CMPs. 2011
88 Memory persistency. 2014
86 Silicon-photonic network architectures for scalable, power-efficient multi-chip systems. 2010
84 Evolution of thread-level parallelism in desktop applications. 2010
84 Catnap: energy proportional multiple network-on-chip. 2013
83 On the feasibility of online malware detection with performance counters. 2013
82 Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management. 2005
82 Hardware atomicity for reliable software speculation. 2007
82 A defect-tolerant accelerator for emerging high-performance applications. 2012
80 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. 2005
80 iSwitch: Coordinating and optimizing renewable energy powered server clusters. 2012
80 EIE: Efficient Inference Engine on Compressed Deep Neural Network. 2016
79 An abacus turn model for time/space-efficient reconfigurable routing. 2011
78 ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency. 2008
77 An intra-chip free-space optical interconnect. 2010
76 Rescue: A Microarchitecture for Testability and Defect Tolerance. 2005
76 Power model validation through thermal measurements. 2007
76 ShiDianNao: shifting vision processing closer to the sensor. 2015
75 Online Estimation of Architectural Vulnerability Factor for Soft Errors. 2008
75 Can traditional programming bridge the Ninja performance gap for parallel computing applications? 2012
74 Application-aware deadlock-free oblivious routing. 2009
74 Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs. 2011
74 Bypass and insertion algorithms for exclusive last-level caches. 2011
74 Simultaneous branch and warp interweaving for sustained GPU performance. 2012
74 Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. 2014
73 Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. 2010
73 TimeWarp: Rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks. 2012
73 The Yin and Yang of power and performance for asymmetric hardware and managed software. 2012
72 A 64-bit stream processor architecture for scientific applications. 2007
71 iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures. 2008
70 Scalable Load and Store Processing in Latency Tolerant Processors. 2005
70 Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture. 2006
70 Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices. 2009
69 Techniques for Efficient Processing in Runahead Execution Engines. 2005
69 Automated design of application specific superscalar processors: an analytical approach. 2007
69 Polymorphic On-Chip Networks. 2008
69 Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security. 2011
68 Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs. 2006
68 Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. 2010
66 Indirect adaptive routing on large scale interconnection networks. 2009
66 Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput. 2011
66 The role of optics in future high radix switch design. 2011
66 Navigating big data with high-throughput, energy-efficient data partitioning. 2013
65 A case for FAME: FPGA architecture model execution. 2010
65 Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems. 2011
65 End-to-end sequential consistency. 2012
64 An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems. 2005
64 Mechanisms for bounding vulnerabilities of processor structures. 2007
64 A case for random shortcut topologies for HPC interconnects. 2012
64 “Whare-map: heterogeneity in “”homogeneous”” warehouse-scale computers. “ 2013
64 Design space exploration and optimization of path oblivious RAM in secure processors. 2013
62 Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches. 2006
62 Internet-scale service infrastructure efficiency. 2009
61 LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. 2012
60 An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors. 2006
60 Simultaneous speculative threading: a novel pipeline architecture implemented in sun’s rock processor. 2009
60 ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. 2010
60 SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. 2011
60 ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates. 2013
60 A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. 2013
59 Deconstructing Commodity Storage Clusters. 2005
59 Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. 2006
59 Rapid identification of architectural bottlenecks via precise event counting. 2011
59 Probabilistic Shared Cache Management (PriSM). 2012
58 Branch regulation: Low-overhead protection from code reuse attacks. 2012
58 Triggered instructions: a control paradigm for spatially-programmed architectures. 2013
58 The CHERI capability model: Revisiting RISC in an age of risk. 2014
57 Thermal modeling and management of DRAM memory systems. 2007
57 Heracles: improving resource efficiency at scale. 2015
57 PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. 2015
56 Side-channel vulnerability factor: A metric for measuring information leakage. 2012
55 Cohesion: a hybrid memory model for accelerators. 2010
54 Memory Model = Instruction Reordering + Store Atomicity. 2006
54 LINQits: big data on little clients. 2013
53 An Evaluation Framework and Instruction Set Architecture for Ion-Trap Based Quantum Micro-Architectures. 2005
53 Profiling a warehouse-scale computer. 2015
52 WiDGET: Wisconsin decoupled grid execution tiles. 2010
52 SpecTLB: a mechanism for speculative address translation. 2011
52 Reducing memory reference energy with opportunistic virtual caching. 2012
52 SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip. 2013
51 Quantum Memory Hierarchies: Efficient Designs to Match Available Parallelism in Quantum Computing. 2006
51 Sampling + DMR: practical and low-overhead permanent fault detection. 2011
51 Tri-level-cell phase change memory: toward an efficient and reliable memory system. 2013
50 Reducing memory access latency with asymmetric DRAM bank organizations. 2013
50 Enabling preemptive multiprogramming on GPUs. 2014
49 CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. 2012
49 Utility-based acceleration of multithreaded applications on asymmetric CMPs. 2013
48 Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors. 2007
48 Continuous real-world inputs can open up alternative accelerator designs. 2013
47 ParallAX: an architecture for real-time physics. 2007
47 i-NVMM: a secure non-volatile main memory system with incremental encryption. 2011
47 The dynamic granularity memory system. 2012
46 RENO - A Rename-Based Instruction Optimizer. 2005
46 Stream chaining: exploiting multiple levels of correlation in data prefetching. 2009
46 Physically Addressed Queueing (PAQ): Improving parallelism in Solid State Disks. 2012
45 Late-binding: enabling unordered load-store queues. 2007
45 Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction. 2008
45 iGPU: Exception support and speculative execution on GPUs. 2012
44 Store Buffer Design in First-Level Multibanked Data Caches. 2005
44 Multiple Instruction Stream Processor. 2006
44 Multi-execution: multicore caching for data-similar executions. 2009
43 VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. 2007
42 Improving Program Efficiency by Packing Instructions into Registers. 2005
42 Area-Performance Trade-offs in Tiled Dataflow Architectures. 2006
42 Using hardware vulnerability factors to enhance AVF analysis. 2010
42 RADISH: Always-on sound and complete race detection in software and hardware. 2012
42 Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems. 2013
41 Performance and power of cache-based reconfigurable computing. 2009
41 A new perspective for efficient virtual-cache coherence. 2013
41 Flicker: a dynamically adaptive architecture for power limited multicore systems. 2013
41 An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. 2013
40 Achieving Out-of-Order Performance with Almost In-Order Complexity. 2008
40 A fault tolerant, area efficient architecture for Shor’s factoring algorithm. 2009
40 Dynamic performance tuning for speculative threads. 2009
40 BOOM: Enabling mobile memory based low-power server DIMMs. 2012
39 Reducing Startup Time in Co-Designed Virtual Machines. 2006
39 Matrix scheduler reloaded. 2007
39 From Speculation to Security: Practical and Efficient Information Flow Tracking Using Speculative Hardware. 2008
39 Forwardflow: a scalable core for power-constrained CMPs. 2010
39 RETCON: transactional repair without replay. 2010
39 CPPC: correctable parity protected cache. 2011
39 AC-DIMM: associative computing with STT-MRAM. 2013
39 SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. 2014
38 Atomic Vector Operations on Chip Multiprocessors. 2008
38 LReplay: a pending period based deterministic replay scheme. 2010
38 Automatic abstraction and fault tolerance in cortical microachitectures. 2011
38 Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. 2013
37 Software-Controlled Priority Characterization of POWER5 Processor. 2008
37 SC2: A statistical compression cache scheme. 2014
36 Transparent control independence (TCI). 2007
36 ECMon: exposing cache events for monitoring. 2009
36 TLSync: support for multiple fast barriers using on-chip transmission lines. 2011
36 Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management. 2011
36 Buffer-on-board memory systems. 2012
36 WebCore: Architectural support for mobile Web browsing. 2014
36 SynFull: Synthetic traffic models capturing cache coherent behaviour. 2014
35 Dynamic Verification of Sequential Consistency. 2005
35 Running a Quantum Circuit at the Speed of Data. 2008
35 Watchdog: Hardware for safe and secure manual memory management and full memory safety. 2012
35 Improving memory scheduling via processor-side load criticality information. 2013
35 Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor. 2014
34 Virtualizing performance asymmetric multi-core systems. 2011
33 Tolerating Dependences Between Large Speculative Threads Via Sub-Threads. 2006
33 Intra-disk Parallelism: An Idea Whose Time Has Come. 2008
33 Demand-driven software race detection using hardware performance counters. 2011
33 Exploring memory consistency for massively-threaded throughput-oriented processors. 2013
33 STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies. 2014
33 Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. 2014
32 Tolerating process variations in nanophotonic on-chip networks. 2012
32 FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion. 2012
32 The locality-aware adaptive cache coherence protocol. 2013
32 Resilient die-stacked DRAM caches. 2013
32 Data reorganization in memory using 3D-stacked DRAM. 2015
32 ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. 2016
32 Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. 2016
31 Data marshaling for multi-core architectures. 2010
31 A case for globally shared-medium on-chip interconnect. 2011
31 Zombie memory: extending memory lifetime by reviving dead blocks. 2013
31 Rumba: an online quality management system for approximate computing. 2015
31 PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. 2016
30 Conditional Memory Ordering. 2006
30 Interconnection Networks for Scalable Quantum Computers. 2006
30 Cooperative boosting: needy versus greedy power management. 2013
29 Aquacore: a programmable architecture for microfluidics. 2007
29 Boosting single-thread performance in multi-core systems through fine-grain multi-threading. 2009
29 Timetraveler: exploiting acyclic races for optimizing memory race recording. 2010
29 Harmony: Collection and analysis of parallel block vectors. 2012
29 PARDIS: A programmable memory controller for the DDRx interfacing standards. 2012
29 Virtualizing power distribution in datacenters. 2013
29 The Dirty-Block Index. 2014
29 Architecting to achieve a billion requests per second throughput on a single key-value store server platform. 2015
27 Revisiting hardware-assisted page walks for virtualized systems. 2012
27 A first-order mechanistic model for architectural vulnerability factor. 2012
27 A micro-architectural analysis of switched photonic multi-chip interconnects. 2012
27 Agile, efficient virtualization power management with low-latency server power states. 2013
27 Redundant memory mappings for fast access to large memories. 2015
27 DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. 2015
26 DNA-based molecular architecture with spatially localized components. 2013
26 Unifying on-chip and inter-node switching within the Anton 2 network. 2014
26 BlueDBM: an appliance for big data analytics. 2015
25 Ginger: control independence using tag rewriting. 2007
25 Flexible reference-counting-based hardware acceleration for garbage collection. 2009
25 A memory system design framework: creating smart memories. 2009
25 Rebound: scalable checkpointing for coherent shared memory. 2011
25 VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors. 2012
25 Protozoa: adaptive granularity cache coherence. 2013
25 QuickSAN: a storage area network for fast, distributed, solid state disks. 2013
25 Architecture implications of pads as a scarce resource. 2014
24 Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification. 2006
24 Boosting mobile GPU performance with a decoupled access/execute fragment processor. 2012
24 Studying multicore processor scaling via reuse distance analysis. 2013
24 Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. 2015
24 Warped-compression: enabling power efficient GPUs through register compression. 2015
24 Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. 2016
23 Improving writeback efficiency with decoupled last-write prediction. 2012
23 Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures. 2012
23 SIMD divergence optimization through intra-warp compaction. 2013
23 Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. 2013
22 Bit mapping for balanced PCM cell programming. 2013
22 Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors. 2013
22 Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. 2014
22 A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. 2015
21 Distributed Arithmetic on a Quantum Multicomputer. 2006
21 Dynamic MIPS rate stabilization in out-of-order processors. 2009
21 Moguls: a model to explore the memory hierarchy for bandwidth improvements. 2011
21 WeeFence: toward making fences free in TSO. 2013
21 Going vertical in memory management: Handling multiplicity by multi-policy. 2014
21 SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers. 2014
21 Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs. 2015
21 CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. 2015
21 A fully associative, tagless DRAM cache. 2015
20 Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors. 2010
20 Flexible auto-refresh: enabling scalable and energy-efficient DRAM refresh reductions. 2015
20 BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. 2015
19 End-to-end register data-flow continuous self-test. 2009
19 OUTRIDER: efficient memory latency tolerance with decoupled strands. 2011
19 Inspection resistant memory: Architectural support for security from physical examination. 2012
19 QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs. 2013
19 Single-graph multiple flows: Energy efficient design alternative for GPGPUs. 2014
19 Exploring the potential of heterogeneous von neumann/dataflow execution models. 2015
18 HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. 2014
18 Stash: have your scratchpad and cache it too. 2015
18 Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. 2016
17 The Future of Virtualization Technology. 2006
17 Necromancer: enhancing system throughput by animating dead cores. 2010
17 Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads. 2014
17 HIOS: A host interface I/O scheduler for Solid State Disks. 2014
17 Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. 2016
16 CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM. 2013
16 Towards sustainable in-situ server systems in the big data era. 2015
16 HEB: deploying and managing hybrid energy buffers for improving datacenter efficiency and economy. 2015
15 Counting Dependence Predictors. 2008
15 Microcoded Architectures for Ion-Tap Quantum Computers. 2008
15 Sentry: light-weight auxiliary memory access control. 2010
15 CODOMs: Protecting software with Code-centric memory Domains. 2014
15 Real-world design and evaluation of compiler-managed GPU redundant multithreading. 2014
15 EOLE: Paving the way for an effective implementation of value prediction. 2014
15 Multiple clone row DRAM: a low latency and area optimized DRAM. 2015
14 Ten ways to waste a parallel computer. 2009
14 Viper: Virtual pipelines for enhanced reliability. 2012
14 Enhancing effective throughput for transmission line-based bus. 2012
14 STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution. 2013
14 Secure I/O device sharing among virtual machines on multiple hosts. 2013
14 Page overlays: an enhanced virtual memory framework to enable fine-grained memory management. 2015
14 Hi-fi playback: tolerating position errors in shift operations of racetrack memory. 2015
13 Performance and security lessons learned from virtualizing the alpha processor. 2007
13 The rebirth of neural networks. 2010
13 CRIB: consolidated rename, issue, and bypass. 2011
13 ArchRanker: A ranking approach to design space exploration. 2014
13 Fine-grain task aggregation and coordination on GPUs. 2014
13 GangES: Gang error simulation for hardware resiliency evaluation. 2014
13 Manycore network interfaces for in-memory rack-scale computing. 2015
13 Callback: efficient synchronization without invalidation with a directory just for spin-waiting. 2015
13 ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. 2015
13 Accelerating Dependent Cache Misses with an Enhanced Memory Controller. 2016
13 Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. 2016
12 Fusion: design tradeoffs in coherent cache hierarchies for accelerators. 2015
12 SLIP: reducing wire energy in the memory hierarchy. 2015
12 Cambricon: An Instruction Set Architecture for Neural Networks. 2016
11 Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits. 2007
11 End-to-end performance forecasting: finding bottlenecks before they happen. 2009
11 Microarchitectural mechanisms to exploit value structure in SIMT architectures. 2013
11 OmniOrder: Directory-based conflict serialization of transactions. 2014
11 Harmonia: balancing compute and memory power in high-performance GPUs. 2015
11 RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. 2016
10 Architectural implications of brick and mortar silicon manufacturing. 2007
10 Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors. 2009
10 Improving virtualization in the presence of software managed translation lookaside buffers. 2013
10 Increasing off-chip bandwidth in multi-core processors with switchable pins. 2014
10 Race Logic: A hardware acceleration for dynamic programming algorithms. 2014
10 Flexible software profiling of GPU architectures. 2015
9 Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines. 2005
9 A Two-Level Load/Store Queue Based on Execution Locality. 2008
9 Replay debugging: Leveraging record and replay for program debugging. 2014
9 Navigating the cache hierarchy with a single lookup. 2014
9 An examination of the architecture and system-level tradeoffs of employing steep slope devices in 3D CMPs. 2014
9 Avoiding core’s DUE&SDC via acoustic wave detectors and tailored error containment and recovery. 2014
9 Thermal time shifting: leveraging phase change materials to reduce cooling costs in warehouse-scale computers. 2015
9 Probable cause: the deanonymizing effects of approximate DRAM. 2015
9 COP: to compress and protect main memory. 2015
8 Setting an error detection infrastructure with low cost acoustic wave detectors. 2012
8 A low power and reliable charge pump design for Phase Change Memories. 2014
8 Improving the energy efficiency of Big Cores. 2014
8 Row-buffer decoupling: A case for low-latency DRAM microarchitecture. 2014
8 Reducing access latency of MLC PCMs through line striping. 2014
8 DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules. 2015
8 PrORAM: dynamic prefetcher for oblivious RAM. 2015
8 Computer performance microscopy with Shim. 2015
8 CloudMonatt: an architecture for security health monitoring and attestation of virtual machines in cloud computing. 2015
8 LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. 2016
7 Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection. 2005
7 Moving the needle, computer architecture research in academe and industry. 2010
7 FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes. 2011
7 Non-race concurrency bug detection through order-sensitive critical sections. 2013
7 FASE: finding amplitude-modulated side-channel emanations. 2015
7 The load slice core microarchitecture. 2015
6 The End of Scaling? Revolutions in Technology and Microarchitecture as We Pass the 90 Nanometer Node. 2006
6 Fetch-Criticality Reduction through Control Independence. 2008
6 Accelerating asynchronous programs through event sneak peek. 2015
6 Reducing world switches in virtualized environment with flexible cross-world calls. 2015
6 Semantic locality and context-based prefetching using reinforcement learning. 2015
6 Efficient execution of memory access phases using dataflow specialization. 2015
6 Clean: a race detector with cleaner semantics. 2015
6 A variable warp size architecture. 2015
6 Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. 2015
6 Dynamo: Facebook’s Data Center-Wide Power Management System. 2016
6 Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. 2016
6 Agile Paging: Exceeding the Best of Nested and Shadow Paging. 2016
5 Improving the future by examining the past. 2010
5 Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability. 2012
5 Configurable fine-grain protection for multicore processor virtualization. 2012
5 Quantum rotations: a case study in static and dynamic machine-code generation for quantum computers. 2013
5 Unified address translation for memory-mapped SSDs with FlashMap. 2015
5 Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. 2016
5 Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. 2016
5 Energy Efficient Architecture for Graph Analytics Accelerators. 2016
5 MITTS: Memory Inter-arrival Time Traffic Shaping. 2016
5 Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. 2016
5 Biscuit: A Framework for Near-Data Processing of Big Data Workloads. 2016
5 CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture. 2016
4 IVEC: off-chip memory integrity protection for both security and reliability. 2010
4 MemGuard: A low cost and energy efficient design to support and enhance memory system reliability. 2014
4 Fractal++: Closing the performance gap between fractal and conventional coherence. 2014
4 Branch vanguard: decomposing branch functionality into prediction and resolution instructions. 2015
4 SHRINK: reducing the ISA complexity via instruction recycling. 2015
4 MiSAR: minimalistic synchronization accelerator with resource overflow management. 2015
4 Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement. 2016
4 Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs. 2016
4 Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures. 2016
4 ASIC Clouds: Specializing the Datacenter. 2016
4 Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. 2016
3 Releasing efficient beta cores to market early. 2011
3 BlockChop: Dynamic squash elimination for hybrid processor architecture. 2012
3 Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol. 2014
3 MBus: an ultra-low power interconnect bus for next generation nanopower systems. 2015
3 Cost-effective speculative scheduling in high performance processors. 2015
3 Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. 2016
3 Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers. 2016
2 Shared caches in multicores: the good, the bad, and the ugly. 2010
2 Deconfigurable microprocessor architectures for silicon debug acceleration. 2013
2 VIP: virtualizing IP chains on handheld platforms. 2015
2 Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures. 2016
2 ARM Virtualization: Performance and Architectural Implications. 2016
2 Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems. 2016
2 PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures. 2016
2 Future Vector Microprocessor Extensions for Data Aggregations. 2016
2 LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches. 2016
2 Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing. 2016
2 Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration. 2016
2 ActivePointers: A Case for Software Address Translation on GPUs. 2016
2 The Anytime Automaton. 2016
1 Computer Architecture Research and Future Microprocessors: Where Do We Go from Here? 2006
1 Efficient digital neurons for large scale cortical architectures. 2014
1 FaultHound: value-locality-based soft-fault tolerance. 2015
1 Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units. 2016
1 Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL. 2016
1 Decoupling Loads for Nano-Instruction Set Computers. 2016
1 Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading. 2016
1 Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity. 2016
1 Boosting Access Parallelism to PCM-Based Main Memory. 2016
1 Power Attack Defense: Securing Battery-Backed Data Centers. 2016
1 Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors. 2016
1 APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. 2016
1 Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation. 2016
1 Asymmetry-Aware Work-Stealing Runtimes. 2016
1 XED: Exposing On-Die Error Detection Information for Strong Memory Reliability. 2016
1 All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory. 2016
0 Message from the General Chair. 2006
0 Message from the Program Chair. 2006
0 SIGARCH Guidelines. 2006
0 LaZy superscalar. 2015
0 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric. 2016
0 Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs. 2016
0 Production-Run Software Failure Diagnosis via Adaptive Communication Tracking. 2016
0 RelaxFault Memory Repair. 2016
0 Evaluation of an Analog Accelerator for Linear Algebra. 2016
0 Base-Victim Compression: An Opportunistic Cache Compression Architecture. 2016

2016

Cited by Paper title
80 EIE: Efficient Inference Engine on Compressed Deep Neural Network.
32 ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars.
32 Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.
31 PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.
24 Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.
18 Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.
17 Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.
13 Accelerating Dependent Cache Misses with an Enhanced Memory Controller.
13 Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.
12 Cambricon: An Instruction Set Architecture for Neural Networks.
11 RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision.
8 LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs.
6 Dynamo: Facebook’s Data Center-Wide Power Management System.
6 Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming.
6 Agile Paging: Exceeding the Best of Nested and Shadow Paging.
5 Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching.
5 Automatic Generation of Efficient Accelerators for Reconfigurable Hardware.
5 Energy Efficient Architecture for Graph Analytics Accelerators.
5 MITTS: Memory Inter-arrival Time Traffic Shaping.
5 Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching.
5 Biscuit: A Framework for Near-Data Processing of Big Data Workloads.
5 CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture.
4 Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement.
4 Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs.
4 Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures.
4 ASIC Clouds: Specializing the Datacenter.
4 Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit.
3 Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference.
3 Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers.
2 Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures.
2 ARM Virtualization: Performance and Architectural Implications.
2 Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems.
2 PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures.
2 Future Vector Microprocessor Extensions for Data Aggregations.
2 LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches.
2 Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing.
2 Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration.
2 ActivePointers: A Case for Software Address Translation on GPUs.
2 The Anytime Automaton.
1 Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units.
1 Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL.
1 Decoupling Loads for Nano-Instruction Set Computers.
1 Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading.
1 Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity.
1 Boosting Access Parallelism to PCM-Based Main Memory.
1 Power Attack Defense: Securing Battery-Backed Data Centers.
1 Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors.
1 APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs.
1 Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation.
1 Asymmetry-Aware Work-Stealing Runtimes.
1 XED: Exposing On-Die Error Detection Information for Strong Memory Reliability.
1 All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory.
0 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric.
0 Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs.
0 Production-Run Software Failure Diagnosis via Adaptive Communication Tracking.
0 RelaxFault Memory Repair.
0 Evaluation of an Analog Accelerator for Linear Algebra.
0 Base-Victim Compression: An Opportunistic Cache Compression Architecture.

2015

Cited by Paper title
98 A scalable processing-in-memory accelerator for parallel graph processing.
76 ShiDianNao: shifting vision processing closer to the sensor.
57 Heracles: improving resource efficiency at scale.
57 PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.
53 Profiling a warehouse-scale computer.
32 Data reorganization in memory using 3D-stacked DRAM.
31 Rumba: an online quality management system for approximate computing.
29 Architecting to achieve a billion requests per second throughput on a single key-value store server platform.
27 Redundant memory mappings for fast access to large memories.
27 DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers.
26 BlueDBM: an appliance for big data analytics.
24 Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8.
24 Warped-compression: enabling power efficient GPUs through register compression.
22 A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.
21 Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs.
21 CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads.
21 A fully associative, tagless DRAM cache.
20 Flexible auto-refresh: enabling scalable and energy-efficient DRAM refresh reductions.
20 BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches.
19 Exploring the potential of heterogeneous von neumann/dataflow execution models.
18 Stash: have your scratchpad and cache it too.
16 Towards sustainable in-situ server systems in the big data era.
16 HEB: deploying and managing hybrid energy buffers for improving datacenter efficiency and economy.
15 Multiple clone row DRAM: a low latency and area optimized DRAM.
14 Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.
14 Hi-fi playback: tolerating position errors in shift operations of racetrack memory.
13 Manycore network interfaces for in-memory rack-scale computing.
13 Callback: efficient synchronization without invalidation with a directory just for spin-waiting.
13 ArMOR: defending against memory consistency model mismatches in heterogeneous architectures.
12 Fusion: design tradeoffs in coherent cache hierarchies for accelerators.
12 SLIP: reducing wire energy in the memory hierarchy.
11 Harmonia: balancing compute and memory power in high-performance GPUs.
10 Flexible software profiling of GPU architectures.
9 Thermal time shifting: leveraging phase change materials to reduce cooling costs in warehouse-scale computers.
9 Probable cause: the deanonymizing effects of approximate DRAM.
9 COP: to compress and protect main memory.
8 DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules.
8 PrORAM: dynamic prefetcher for oblivious RAM.
8 Computer performance microscopy with Shim.
8 CloudMonatt: an architecture for security health monitoring and attestation of virtual machines in cloud computing.
7 FASE: finding amplitude-modulated side-channel emanations.
7 The load slice core microarchitecture.
6 Accelerating asynchronous programs through event sneak peek.
6 Reducing world switches in virtualized environment with flexible cross-world calls.
6 Semantic locality and context-based prefetching using reinforcement learning.
6 Efficient execution of memory access phases using dataflow specialization.
6 Clean: a race detector with cleaner semantics.
6 A variable warp size architecture.
6 Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures.
5 Unified address translation for memory-mapped SSDs with FlashMap.
4 Branch vanguard: decomposing branch functionality into prediction and resolution instructions.
4 SHRINK: reducing the ISA complexity via instruction recycling.
4 MiSAR: minimalistic synchronization accelerator with resource overflow management.
3 MBus: an ultra-low power interconnect bus for next generation nanopower systems.
3 Cost-effective speculative scheduling in high performance processors.
2 VIP: virtualizing IP chains on handheld platforms.
1 FaultHound: value-locality-based soft-fault tolerance.
0 LaZy superscalar.

2014

Cited by Paper title
353 A reconfigurable fabric for accelerating large-scale datacenter services.
131 Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.
101 Towards energy proportionality for large-scale latency-critical workloads.
95 General-purpose code acceleration with limited-precision analog computation.
88 Memory persistency.
74 Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.
58 The CHERI capability model: Revisiting RISC in an age of risk.
50 Enabling preemptive multiprogramming on GPUs.
39 SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering.
37 SC2: A statistical compression cache scheme.
36 WebCore: Architectural support for mobile Web browsing.
36 SynFull: Synthetic traffic models capturing cache coherent behaviour.
35 Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor.
33 STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies.
33 Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation.
29 The Dirty-Block Index.
26 Unifying on-chip and inter-node switching within the Anton 2 network.
25 Architecture implications of pads as a scarce resource.
22 Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization.
21 Going vertical in memory management: Handling multiplicity by multi-policy.
21 SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers.
19 Single-graph multiple flows: Energy efficient design alternative for GPGPUs.
18 HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs.
17 Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads.
17 HIOS: A host interface I/O scheduler for Solid State Disks.
15 CODOMs: Protecting software with Code-centric memory Domains.
15 Real-world design and evaluation of compiler-managed GPU redundant multithreading.
15 EOLE: Paving the way for an effective implementation of value prediction.
13 ArchRanker: A ranking approach to design space exploration.
13 Fine-grain task aggregation and coordination on GPUs.
13 GangES: Gang error simulation for hardware resiliency evaluation.
11 OmniOrder: Directory-based conflict serialization of transactions.
10 Increasing off-chip bandwidth in multi-core processors with switchable pins.
10 Race Logic: A hardware acceleration for dynamic programming algorithms.
9 Replay debugging: Leveraging record and replay for program debugging.
9 Navigating the cache hierarchy with a single lookup.
9 An examination of the architecture and system-level tradeoffs of employing steep slope devices in 3D CMPs.
9 Avoiding core’s DUE&SDC via acoustic wave detectors and tailored error containment and recovery.
8 A low power and reliable charge pump design for Phase Change Memories.
8 Improving the energy efficiency of Big Cores.
8 Row-buffer decoupling: A case for low-latency DRAM microarchitecture.
8 Reducing access latency of MLC PCMs through line striping.
4 MemGuard: A low cost and energy efficient design to support and enhance memory system reliability.
4 Fractal++: Closing the performance gap between fractal and conventional coherence.
3 Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol.
1 Efficient digital neurons for large scale cortical architectures.

2013

Cited by Paper title
272 GPUWattch: enabling energy optimizations in GPGPUs.
143 ZSim: fast and accurate microarchitectural simulation of thousand-core systems.
123 Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers.
121 Thin servers with smart pipes: designing SoC accelerators for memcached.
119 An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.
107 Efficient virtual memory for big memory servers.
105 Convolution engine: balancing efficiency&flexibility in specialized computing.
96 Orchestrated scheduling and prefetching for GPGPUs.
95 Robust architectural support for transactional memory in the power architecture.
94 Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache.
84 Catnap: energy proportional multiple network-on-chip.
83 On the feasibility of online malware detection with performance counters.
66 Navigating big data with high-throughput, energy-efficient data partitioning.
64 “Whare-map: heterogeneity in “”homogeneous”” warehouse-scale computers. “
64 Design space exploration and optimization of path oblivious RAM in secure processors.
60 ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates.
60 A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness.
58 Triggered instructions: a control paradigm for spatially-programmed architectures.
54 LINQits: big data on little clients.
52 SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip.
51 Tri-level-cell phase change memory: toward an efficient and reliable memory system.
50 Reducing memory access latency with asymmetric DRAM bank organizations.
49 Utility-based acceleration of multithreaded applications on asymmetric CMPs.
48 Continuous real-world inputs can open up alternative accelerator designs.
42 Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems.
41 A new perspective for efficient virtual-cache coherence.
41 Flicker: a dynamically adaptive architecture for power limited multicore systems.
41 An energy-efficient and scalable eDRAM-based register file architecture for GPGPU.
39 AC-DIMM: associative computing with STT-MRAM.
38 Criticality stacks: identifying critical threads in parallel programs using synchronization behavior.
35 Improving memory scheduling via processor-side load criticality information.
33 Exploring memory consistency for massively-threaded throughput-oriented processors.
32 The locality-aware adaptive cache coherence protocol.
32 Resilient die-stacked DRAM caches.
31 Zombie memory: extending memory lifetime by reviving dead blocks.
30 Cooperative boosting: needy versus greedy power management.
29 Virtualizing power distribution in datacenters.
27 Agile, efficient virtualization power management with low-latency server power states.
26 DNA-based molecular architecture with spatially localized components.
25 Protozoa: adaptive granularity cache coherence.
25 QuickSAN: a storage area network for fast, distributed, solid state disks.
24 Studying multicore processor scaling via reuse distance analysis.
23 SIMD divergence optimization through intra-warp compaction.
23 Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation.
22 Bit mapping for balanced PCM cell programming.
22 Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors.
21 WeeFence: toward making fences free in TSO.
19 QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs.
16 CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM.
14 STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution.
14 Secure I/O device sharing among virtual machines on multiple hosts.
11 Microarchitectural mechanisms to exploit value structure in SIMT architectures.
10 Improving virtualization in the presence of software managed translation lookaside buffers.
7 Non-race concurrency bug detection through order-sensitive critical sections.
5 Quantum rotations: a case study in static and dynamic machine-code generation for quantum computers.
2 Deconfigurable microprocessor architectures for silicon debug acceleration.

2012

Cited by Paper title
236 RAIDR: Retention-aware intelligent DRAM refresh.
205 Scheduling heterogeneous multi-cores through performance impact estimation (PIE).
145 Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.
144 Towards energy-proportional datacenter memory with mobile DRAM.
140 A case for exploiting subarray-level parallelism (SALP) in DRAM.
136 Scale-out processors.
118 Managing distributed UPS energy for effective power capping in data centers.
106 PreSET: Improving performance of phase change memories by exploiting asymmetry in write times.
82 A defect-tolerant accelerator for emerging high-performance applications.
80 iSwitch: Coordinating and optimizing renewable energy powered server clusters.
75 Can traditional programming bridge the Ninja performance gap for parallel computing applications?
74 Simultaneous branch and warp interweaving for sustained GPU performance.
73 TimeWarp: Rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks.
73 The Yin and Yang of power and performance for asymmetric hardware and managed software.
65 End-to-end sequential consistency.
64 A case for random shortcut topologies for HPC interconnects.
61 LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems.
59 Probabilistic Shared Cache Management (PriSM).
58 Branch regulation: Low-overhead protection from code reuse attacks.
56 Side-channel vulnerability factor: A metric for measuring information leakage.
52 Reducing memory reference energy with opportunistic virtual caching.
49 CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures.
47 The dynamic granularity memory system.
46 Physically Addressed Queueing (PAQ): Improving parallelism in Solid State Disks.
45 iGPU: Exception support and speculative execution on GPUs.
42 RADISH: Always-on sound and complete race detection in software and hardware.
40 BOOM: Enabling mobile memory based low-power server DIMMs.
36 Buffer-on-board memory systems.
35 Watchdog: Hardware for safe and secure manual memory management and full memory safety.
32 Tolerating process variations in nanophotonic on-chip networks.
32 FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion.
29 Harmony: Collection and analysis of parallel block vectors.
29 PARDIS: A programmable memory controller for the DDRx interfacing standards.
27 Revisiting hardware-assisted page walks for virtualized systems.
27 A first-order mechanistic model for architectural vulnerability factor.
27 A micro-architectural analysis of switched photonic multi-chip interconnects.
25 VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors.
24 Boosting mobile GPU performance with a decoupled access/execute fragment processor.
23 Improving writeback efficiency with decoupled last-write prediction.
23 Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures.
19 Inspection resistant memory: Architectural support for security from physical examination.
14 Viper: Virtual pipelines for enhanced reliability.
14 Enhancing effective throughput for transmission line-based bus.
8 Setting an error detection infrastructure with low cost acoustic wave detectors.
5 Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability.
5 Configurable fine-grain protection for multicore processor virtualization.
3 BlockChop: Dynamic squash elimination for hybrid processor architecture.

2011

Cited by Paper title
1203 Dark silicon and the end of multicore scaling.
295 Power management of online data-intensive services.
175 Energy-efficient mechanisms for managing thread context in throughput processors.
172 The impact of memory subsystem resource sharing on datacenter applications.
160 Benefits and limitations of tapping into stored energy for datacenters.
157 Vantage: scalable and efficient fine-grain cache partitioning.
134 Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.
132 Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks.
120 FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template.
114 DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip.
111 Energy-efficient cache design using variable-strength error-correcting codes.
98 Scalable power control for many-core architectures running multi-threaded applications.
98 Prefetch-aware shared resource management for multi-core systems.
97 Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators.
88 A case for heterogeneous on-chip interconnects for CMPs.
79 An abacus turn model for time/space-efficient reconfigurable routing.
74 Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs.
74 Bypass and insertion algorithms for exclusive last-level caches.
69 Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security.
66 Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput.
66 The role of optics in future high radix switch design.
65 Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems.
60 SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading.
59 Rapid identification of architectural bottlenecks via precise event counting.
52 SpecTLB: a mechanism for speculative address translation.
51 Sampling + DMR: practical and low-overhead permanent fault detection.
47 i-NVMM: a secure non-volatile main memory system with incremental encryption.
39 CPPC: correctable parity protected cache.
38 Automatic abstraction and fault tolerance in cortical microachitectures.
36 TLSync: support for multiple fast barriers using on-chip transmission lines.
36 Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management.
34 Virtualizing performance asymmetric multi-core systems.
33 Demand-driven software race detection using hardware performance counters.
31 A case for globally shared-medium on-chip interconnect.
25 Rebound: scalable checkpointing for coherent shared memory.
21 Moguls: a model to explore the memory hierarchy for bandwidth improvements.
19 OUTRIDER: efficient memory latency tolerance with decoupled strands.
13 CRIB: consolidated rename, issue, and bypass.
7 FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes.
3 Releasing efficient beta cores to market early.

2010

Cited by Paper title
756 Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.
413 High performance cache replacement using re-reference interval prediction (RRIP).
389 An integrated GPU power and performance model.
361 Energy proportional datacenter networks.
318 Understanding sources of inefficiency in general-purpose chips.
254 NoHype: virtualized cloud infrastructure without the virtualization.
232 Use ECP, not ECC, for hard failures in resistive memories.
211 Rethinking DRAM design and organization for energy-constrained multi-cores.
195 Relax: an architectural framework for software recovery of hardware faults.
189 Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping.
186 Dynamic warp subdivision for integrated branch and memory divergence tolerance.
173 Reducing cache power with low-cost, multi-bit error-correcting codes.
172 Web search using mobile cores: quantifying and mitigating the price of efficiency.
143 Morphable memory system: a robust architecture for exploiting multi-level phase change memories.
141 Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing.
139 Aérgia: exploiting packet latency slack in on-chip networks.
124 A dynamically configurable coprocessor for convolutional neural networks.
124 Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis.
110 Translation caching: skip, don’t walk (the page table).
100 Re-architecting DRAM memory systems with monolithically integrated silicon photonics.
99 The impact of management operations on the virtualized datacenter.
99 SieveStore: a highly-selective, ensemble-level disk cache for cost-performance.
94 Modeling critical sections in Amdahl’s law and its implications for multicore design.
89 Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races.
88 The virtual write queue: coordinating DRAM and last-level cache policies.
86 Silicon-photonic network architectures for scalable, power-efficient multi-chip systems.
84 Evolution of thread-level parallelism in desktop applications.
77 An intra-chip free-space optical interconnect.
73 Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications.
68 Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors.
65 A case for FAME: FPGA architecture model execution.
60 ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations.
55 Cohesion: a hybrid memory model for accelerators.
52 WiDGET: Wisconsin decoupled grid execution tiles.
42 Using hardware vulnerability factors to enhance AVF analysis.
39 Forwardflow: a scalable core for power-constrained CMPs.
39 RETCON: transactional repair without replay.
38 LReplay: a pending period based deterministic replay scheme.
31 Data marshaling for multi-core architectures.
29 Timetraveler: exploiting acyclic races for optimizing memory race recording.
20 Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors.
17 Necromancer: enhancing system throughput by animating dead cores.
15 Sentry: light-weight auxiliary memory access control.
13 The rebirth of neural networks.
7 Moving the needle, computer architecture research in academe and industry.
5 Improving the future by examining the past.
4 IVEC: off-chip memory integrity protection for both security and reliability.
2 Shared caches in multicores: the good, the bad, and the ugly.

2009

Cited by Paper title
937 Scalable high performance main memory system using phase-change memory technology.
875 Architecting phase change memory as a scalable dram alternative.
644 A durable and energy efficient main memory using phase change memory technology.
532 An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness.
380 Reactive NUCA: near-optimal block placement and replication in distributed caches.
337 Firefly: illuminating future network-on-chip with nanophotonics.
316 A case for bufferless routing in on-chip networks.
306 Hybrid cache architecture with disparate memory technologies.
267 PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches.
238 Hardware support for WCET analysis of hard real-time multicore systems.
236 Scaling the bandwidth wall: challenges in and avenues for CMP scaling.
225 Thread motion: fine-grained power management for multi-core systems.
190 The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization.
174 Temperature-constrained power control for chip multiprocessors with online model estimation.
170 Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors.
169 Phastlane: a rapid transit optical routing network.
159 Rigel: an architecture and scalable programming interface for a 1000-core accelerator.
154 Disaggregated memory for expansion and sharing in blade servers.
137 Achieving predictable performance through better memory controller placement in many-core CMPs.
137 A case for an interleaving constrained shared-memory multi-processor.
124 Architectural core salvaging in a multi-core processor for hard-error tolerance.
106 Memory mapped ECC: low-cost error protection for last level caches.
100 SigRace: signature-based data race detection.
96 Spatio-temporal memory streaming.
94 AnySP: anytime anywhere anyway signal processing.
90 InvisiFence: performance-transparent memory ordering in conventional multiprocessors.
74 Application-aware deadlock-free oblivious routing.
70 Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices.
66 Indirect adaptive routing on large scale interconnection networks.
62 Internet-scale service infrastructure efficiency.
60 Simultaneous speculative threading: a novel pipeline architecture implemented in sun’s rock processor.
46 Stream chaining: exploiting multiple levels of correlation in data prefetching.
44 Multi-execution: multicore caching for data-similar executions.
41 Performance and power of cache-based reconfigurable computing.
40 A fault tolerant, area efficient architecture for Shor’s factoring algorithm.
40 Dynamic performance tuning for speculative threads.
36 ECMon: exposing cache events for monitoring.
29 Boosting single-thread performance in multi-core systems through fine-grain multi-threading.
25 Flexible reference-counting-based hardware acceleration for garbage collection.
25 A memory system design framework: creating smart memories.
21 Dynamic MIPS rate stabilization in out-of-order processors.
19 End-to-end register data-flow continuous self-test.
14 Ten ways to waste a parallel computer.
11 End-to-end performance forecasting: finding bottlenecks before they happen.
10 Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors.

2008

Cited by Paper title
625 Corona: System Implications of Emerging Nanophotonic Technology.
588 3D-Stacked Memory Architectures for Multi-core Processors.
448 Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems.
367 Technology-Driven, Highly-Scalable Dragonfly Topology.
292 Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors.
290 Improving NAND Flash Based Disk Caches.
288 Self-Optimizing Memory Controllers: A Reinforcement Learning Approach.
225 Trading off Cache Capacity for Reliability to Enable Low Voltage Operation.
222 Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support.
221 MIRA: A Multi-layered On-Chip Interconnect Router Architecture.
217 DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently.
217 A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies.
216 Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments.
179 Rerun: Exploiting Episodes for Lightweight Memory Race Recording.
160 Flexible Decoupled Transactional Memory Support.
150 Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks.
134 TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory.
130 Flexible Hardware Acceleration for Instruction-Grain Program Monitoring.
118 Atom-Aid: Detecting and Surviving Atomicity Violations.
113 Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory.
113 VEAL: Virtualized Execution Accelerator for Loops.
91 A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime.
78 ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency.
75 Online Estimation of Architectural Vulnerability Factor for Soft Errors.
71 iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures.
69 Polymorphic On-Chip Networks.
45 Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction.
40 Achieving Out-of-Order Performance with Almost In-Order Complexity.
39 From Speculation to Security: Practical and Efficient Information Flow Tracking Using Speculative Hardware.
38 Atomic Vector Operations on Chip Multiprocessors.
37 Software-Controlled Priority Characterization of POWER5 Processor.
35 Running a Quantum Circuit at the Speed of Data.
33 Intra-disk Parallelism: An Idea Whose Time Has Come.
15 Counting Dependence Predictors.
15 Microcoded Architectures for Ion-Tap Quantum Computers.
9 A Two-Level Load/Store Queue Based on Execution Locality.
6 Fetch-Criticality Reduction through Control Independence.

2007

Cited by Paper title
1553 Power provisioning for a warehouse-sized computer.
547 Adaptive insertion policies for high performance caching.
451 Anton, a special-purpose machine for molecular dynamics simulation.
368 Express virtual channels: towards the ideal interconnection fabric.
365 An effective hybrid transactional memory system with strong isolation guarantees.
363 Flattened butterfly: a cost-efficient topology for high-radix networks.
294 Core fusion: accommodating software diversity in chip multiprocessors.
285 Raksha: a flexible information flow architecture for software security.
265 A novel dimensionally-decomposed router for on-chip communication in 3D architectures.
263 New cache designs for thwarting software cache-based side channel attacks.
258 Performance pathologies in hardware transactional memory.
225 Carbon: architectural support for fine-grained parallelism on chip multiprocessors.
223 BulkSC: bulk enforcement of sequential consistency.
207 Virtual hierarchies to support server consolidation.
192 Configurable isolation: building high availability systems with commodity multi-core processors.
192 Virtual private caches.
162 Making the fast case common and the uncommon case simple in unbounded transactional memory.
157 Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite.
153 ReCycle: : pipeline adaptation to tolerate process variation.
138 An integrated hardware-software approach to flexible transactional memory.
137 Limiting the power consumption of main memory.
127 Comparing memory systems for chip multiprocessors.
125 Dynamic prediction of architectural vulnerability from microarchitectural state.
109 Interconnect design considerations for large NUCA caches.
103 Examining ACE analysis reliability estimates using fault-injection.
99 Mechanisms for store-wait-free multiprocessors.
96 MetaTM//TxLinux: transactional memory for an operating system.
90 Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures.
88 Rotary router: an efficient architecture for CMP interconnection networks.
82 Hardware atomicity for reliable software speculation.
76 Power model validation through thermal measurements.
72 A 64-bit stream processor architecture for scientific applications.
69 Automated design of application specific superscalar processors: an analytical approach.
64 Mechanisms for bounding vulnerabilities of processor structures.
57 Thermal modeling and management of DRAM memory systems.
48 Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors.
47 ParallAX: an architecture for real-time physics.
45 Late-binding: enabling unordered load-store queues.
43 VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization.
39 Matrix scheduler reloaded.
36 Transparent control independence (TCI).
29 Aquacore: a programmable architecture for microfluidics.
25 Ginger: control independence using tag rewriting.
13 Performance and security lessons learned from virtualizing the alpha processor.
11 Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits.
10 Architectural implications of brick and mortar silicon manufacturing.

2006

Cited by Paper title
539 Techniques for Multicore Thermal Management: Classification and New Exploration.
477 Cooperative Caching for Chip Multiprocessors.
427 Design and Management of 3D Chip Multiprocessors Using Network-in-Memory.
376 Ensemble-level Power Management for Dense Blade Servers.
327 Bulk Disambiguation of Speculative Threads in Multiprocessors.
289 A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching.
267 A Case for MLP-Aware Cache Replacement.
253 SODA: A Low-power Architecture For Software Radio.
229 A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks.
217 Architectural Semantics for Practical Transactional Memory.
208 The BlackWidow High-Radix Clos Network.
165 Spatial Memory Streaming.
139 Improving Cost, Performance, and Security of Memory Encryption and Authentication.
131 Interconnect-Aware Coherence Protocols for Chip Multiprocessors.
120 Learning-Based SMT Processor Resource Distribution via Hill-Climbing.
120 TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time.
70 Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture.
68 Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs.
62 Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches.
60 An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors.
59 Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors.
54 Memory Model = Instruction Reordering + Store Atomicity.
51 Quantum Memory Hierarchies: Efficient Designs to Match Available Parallelism in Quantum Computing.
44 Multiple Instruction Stream Processor.
42 Area-Performance Trade-offs in Tiled Dataflow Architectures.
39 Reducing Startup Time in Co-Designed Virtual Machines.
33 Tolerating Dependences Between Large Speculative Threads Via Sub-Threads.
30 Conditional Memory Ordering.
30 Interconnection Networks for Scalable Quantum Computers.
24 Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification.
21 Distributed Arithmetic on a Quantum Multicomputer.
17 The Future of Virtualization Technology.
6 The End of Scaling? Revolutions in Technology and Microarchitecture as We Pass the 90 Nanometer Node.
1 Computer Architecture Research and Future Microprocessors: Where Do We Go from Here?
0 Message from the General Chair.
0 Message from the Program Chair.
0 SIGARCH Guidelines.

2005

Cited by Paper title
610 Continuous Optimization.
511 Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling.
497 Virtualizing Transactional Memory.
411 Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors.
341 BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging.
341 Optimizing Replication, Communication, and Capacity Allocation in CMPs.
334 A High Throughput String Matching Architecture for Intrusion Detection and Prevention.
305 The Impact of Performance Asymmetry in Emerging Multicore Architectures.
263 Mitigating Amdahl’s Law through EPI Throttling.
232 Microarchitecture of a High-Radix Router.
230 Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions.
227 Exploiting Structural Duplication for Lifetime Reliability Enhancement.
220 The V-Way Cache: Demand Based Associativity via Global Replacement.
215 Computing Architectural Vulnerability Factors for Address-Based Structures.
211 Architecture for Protecting Critical Secrets in Microprocessors.
195 Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks.
195 An Ultra Low Power System Architecture for Sensor Network Applications.
191 RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence.
178 An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors.
150 Design and Evaluation of Hybrid Fault-Detection Systems.
148 Opportunistic Transient-Fault Detection.
148 Direct Cache Access for High Bandwidth Network I/O.
125 A Robust Main-Memory Compression Scheme.
124 Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking.
122 Temporal Streaming of Shared Memory.
118 Analysis of the O-GEometric History Length Branch Predictor.
109 Energy Optimization of Subthreshold-Voltage Sensor Network Processors.
101 Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors.
99 High Efficiency Counter Mode Security Architecture via Prediction and Precomputation.
98 A Tree Based Router Search Engine Architecture with Single Port Memories.
95 Piecewise Linear Branch Prediction.
82 Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management.
80 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization.
76 Rescue: A Microarchitecture for Testability and Defect Tolerance.
70 Scalable Load and Store Processing in Latency Tolerant Processors.
69 Techniques for Efficient Processing in Runahead Execution Engines.
64 An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems.
59 Deconstructing Commodity Storage Clusters.
53 An Evaluation Framework and Instruction Set Architecture for Ion-Trap Based Quantum Micro-Architectures.
46 RENO - A Rename-Based Instruction Optimizer.
44 Store Buffer Design in First-Level Multibanked Data Caches.
42 Improving Program Efficiency by Packing Instructions into Registers.
35 Dynamic Verification of Sequential Consistency.
9 Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines.
7 Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection.