ISCA¶

All¶

Cited by	Paper title	Year
1553	Power provisioning for a warehouse-sized computer.	2007
1203	Dark silicon and the end of multicore scaling.	2011
937	Scalable high performance main memory system using phase-change memory technology.	2009
875	Architecting phase change memory as a scalable dram alternative.	2009
756	Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.	2010
644	A durable and energy efficient main memory using phase change memory technology.	2009
625	Corona: System Implications of Emerging Nanophotonic Technology.	2008
610	Continuous Optimization.	2005
588	3D-Stacked Memory Architectures for Multi-core Processors.	2008
547	Adaptive insertion policies for high performance caching.	2007
539	Techniques for Multicore Thermal Management: Classification and New Exploration.	2006
532	An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness.	2009
511	Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling.	2005
497	Virtualizing Transactional Memory.	2005
477	Cooperative Caching for Chip Multiprocessors.	2006
451	Anton, a special-purpose machine for molecular dynamics simulation.	2007
448	Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems.	2008
427	Design and Management of 3D Chip Multiprocessors Using Network-in-Memory.	2006
413	High performance cache replacement using re-reference interval prediction (RRIP).	2010
411	Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors.	2005
389	An integrated GPU power and performance model.	2010
380	Reactive NUCA: near-optimal block placement and replication in distributed caches.	2009
376	Ensemble-level Power Management for Dense Blade Servers.	2006
368	Express virtual channels: towards the ideal interconnection fabric.	2007
367	Technology-Driven, Highly-Scalable Dragonfly Topology.	2008
365	An effective hybrid transactional memory system with strong isolation guarantees.	2007
363	Flattened butterfly: a cost-efficient topology for high-radix networks.	2007
361	Energy proportional datacenter networks.	2010
353	A reconfigurable fabric for accelerating large-scale datacenter services.	2014
341	BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging.	2005
341	Optimizing Replication, Communication, and Capacity Allocation in CMPs.	2005
337	Firefly: illuminating future network-on-chip with nanophotonics.	2009
334	A High Throughput String Matching Architecture for Intrusion Detection and Prevention.	2005
327	Bulk Disambiguation of Speculative Threads in Multiprocessors.	2006
318	Understanding sources of inefficiency in general-purpose chips.	2010
316	A case for bufferless routing in on-chip networks.	2009
306	Hybrid cache architecture with disparate memory technologies.	2009
305	The Impact of Performance Asymmetry in Emerging Multicore Architectures.	2005
295	Power management of online data-intensive services.	2011
294	Core fusion: accommodating software diversity in chip multiprocessors.	2007
292	Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors.	2008
290	Improving NAND Flash Based Disk Caches.	2008
289	A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching.	2006
288	Self-Optimizing Memory Controllers: A Reinforcement Learning Approach.	2008
285	Raksha: a flexible information flow architecture for software security.	2007
272	GPUWattch: enabling energy optimizations in GPGPUs.	2013
267	A Case for MLP-Aware Cache Replacement.	2006
267	PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches.	2009
265	A novel dimensionally-decomposed router for on-chip communication in 3D architectures.	2007
263	Mitigating Amdahl’s Law through EPI Throttling.	2005
263	New cache designs for thwarting software cache-based side channel attacks.	2007
258	Performance pathologies in hardware transactional memory.	2007
254	NoHype: virtualized cloud infrastructure without the virtualization.	2010
253	SODA: A Low-power Architecture For Software Radio.	2006
238	Hardware support for WCET analysis of hard real-time multicore systems.	2009
236	Scaling the bandwidth wall: challenges in and avenues for CMP scaling.	2009
236	RAIDR: Retention-aware intelligent DRAM refresh.	2012
232	Microarchitecture of a High-Radix Router.	2005
232	Use ECP, not ECC, for hard failures in resistive memories.	2010
230	Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions.	2005
229	A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks.	2006
227	Exploiting Structural Duplication for Lifetime Reliability Enhancement.	2005
225	Carbon: architectural support for fine-grained parallelism on chip multiprocessors.	2007
225	Trading off Cache Capacity for Reliability to Enable Low Voltage Operation.	2008
225	Thread motion: fine-grained power management for multi-core systems.	2009
223	BulkSC: bulk enforcement of sequential consistency.	2007
222	Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support.	2008
221	MIRA: A Multi-layered On-Chip Interconnect Router Architecture.	2008
220	The V-Way Cache: Demand Based Associativity via Global Replacement.	2005
217	Architectural Semantics for Practical Transactional Memory.	2006
217	DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently.	2008
217	A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies.	2008
216	Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments.	2008
215	Computing Architectural Vulnerability Factors for Address-Based Structures.	2005
211	Architecture for Protecting Critical Secrets in Microprocessors.	2005
211	Rethinking DRAM design and organization for energy-constrained multi-cores.	2010
208	The BlackWidow High-Radix Clos Network.	2006
207	Virtual hierarchies to support server consolidation.	2007
205	Scheduling heterogeneous multi-cores through performance impact estimation (PIE).	2012
195	Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks.	2005
195	An Ultra Low Power System Architecture for Sensor Network Applications.	2005
195	Relax: an architectural framework for software recovery of hardware faults.	2010
192	Configurable isolation: building high availability systems with commodity multi-core processors.	2007
192	Virtual private caches.	2007
191	RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence.	2005
190	The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization.	2009
189	Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping.	2010
186	Dynamic warp subdivision for integrated branch and memory divergence tolerance.	2010
179	Rerun: Exploiting Episodes for Lightweight Memory Race Recording.	2008
178	An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors.	2005
175	Energy-efficient mechanisms for managing thread context in throughput processors.	2011
174	Temperature-constrained power control for chip multiprocessors with online model estimation.	2009
173	Reducing cache power with low-cost, multi-bit error-correcting codes.	2010
172	Web search using mobile cores: quantifying and mitigating the price of efficiency.	2010
172	The impact of memory subsystem resource sharing on datacenter applications.	2011
170	Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors.	2009
169	Phastlane: a rapid transit optical routing network.	2009
165	Spatial Memory Streaming.	2006
162	Making the fast case common and the uncommon case simple in unbounded transactional memory.	2007
160	Flexible Decoupled Transactional Memory Support.	2008
160	Benefits and limitations of tapping into stored energy for datacenters.	2011
159	Rigel: an architecture and scalable programming interface for a 1000-core accelerator.	2009
157	Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite.	2007
157	Vantage: scalable and efficient fine-grain cache partitioning.	2011
154	Disaggregated memory for expansion and sharing in blade servers.	2009
153	ReCycle: : pipeline adaptation to tolerate process variation.	2007
150	Design and Evaluation of Hybrid Fault-Detection Systems.	2005
150	Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks.	2008
148	Opportunistic Transient-Fault Detection.	2005
148	Direct Cache Access for High Bandwidth Network I/O.	2005
145	Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.	2012
144	Towards energy-proportional datacenter memory with mobile DRAM.	2012
143	Morphable memory system: a robust architecture for exploiting multi-level phase change memories.	2010
143	ZSim: fast and accurate microarchitectural simulation of thousand-core systems.	2013
141	Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing.	2010
140	A case for exploiting subarray-level parallelism (SALP) in DRAM.	2012
139	Improving Cost, Performance, and Security of Memory Encryption and Authentication.	2006
139	Aérgia: exploiting packet latency slack in on-chip networks.	2010
138	An integrated hardware-software approach to flexible transactional memory.	2007
137	Limiting the power consumption of main memory.	2007
137	Achieving predictable performance through better memory controller placement in many-core CMPs.	2009
137	A case for an interleaving constrained shared-memory multi-processor.	2009
136	Scale-out processors.	2012
134	TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory.	2008
134	Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.	2011
132	Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks.	2011
131	Interconnect-Aware Coherence Protocols for Chip Multiprocessors.	2006
131	Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.	2014
130	Flexible Hardware Acceleration for Instruction-Grain Program Monitoring.	2008
127	Comparing memory systems for chip multiprocessors.	2007
125	A Robust Main-Memory Compression Scheme.	2005
125	Dynamic prediction of architectural vulnerability from microarchitectural state.	2007
124	Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking.	2005
124	Architectural core salvaging in a multi-core processor for hard-error tolerance.	2009
124	A dynamically configurable coprocessor for convolutional neural networks.	2010
124	Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis.	2010
123	Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers.	2013
122	Temporal Streaming of Shared Memory.	2005
121	Thin servers with smart pipes: designing SoC accelerators for memcached.	2013
120	Learning-Based SMT Processor Resource Distribution via Hill-Climbing.	2006
120	TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time.	2006
120	FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template.	2011
119	An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.	2013
118	Analysis of the O-GEometric History Length Branch Predictor.	2005
118	Atom-Aid: Detecting and Surviving Atomicity Violations.	2008
118	Managing distributed UPS energy for effective power capping in data centers.	2012
114	DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip.	2011
113	Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory.	2008
113	VEAL: Virtualized Execution Accelerator for Loops.	2008
111	Energy-efficient cache design using variable-strength error-correcting codes.	2011
110	Translation caching: skip, don’t walk (the page table).	2010
109	Energy Optimization of Subthreshold-Voltage Sensor Network Processors.	2005
109	Interconnect design considerations for large NUCA caches.	2007
107	Efficient virtual memory for big memory servers.	2013
106	Memory mapped ECC: low-cost error protection for last level caches.	2009
106	PreSET: Improving performance of phase change memories by exploiting asymmetry in write times.	2012
105	Convolution engine: balancing efficiency&flexibility in specialized computing.	2013
103	Examining ACE analysis reliability estimates using fault-injection.	2007
101	Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors.	2005
101	Towards energy proportionality for large-scale latency-critical workloads.	2014
100	SigRace: signature-based data race detection.	2009
100	Re-architecting DRAM memory systems with monolithically integrated silicon photonics.	2010
99	High Efficiency Counter Mode Security Architecture via Prediction and Precomputation.	2005
99	Mechanisms for store-wait-free multiprocessors.	2007
99	The impact of management operations on the virtualized datacenter.	2010
99	SieveStore: a highly-selective, ensemble-level disk cache for cost-performance.	2010
98	A Tree Based Router Search Engine Architecture with Single Port Memories.	2005
98	Scalable power control for many-core architectures running multi-threaded applications.	2011
98	Prefetch-aware shared resource management for multi-core systems.	2011
98	A scalable processing-in-memory accelerator for parallel graph processing.	2015
97	Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators.	2011
96	MetaTM//TxLinux: transactional memory for an operating system.	2007
96	Spatio-temporal memory streaming.	2009
96	Orchestrated scheduling and prefetching for GPGPUs.	2013
95	Piecewise Linear Branch Prediction.	2005
95	Robust architectural support for transactional memory in the power architecture.	2013
95	General-purpose code acceleration with limited-precision analog computation.	2014
94	AnySP: anytime anywhere anyway signal processing.	2009
94	Modeling critical sections in Amdahl’s law and its implications for multicore design.	2010
94	Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache.	2013
91	A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime.	2008
90	Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures.	2007
90	InvisiFence: performance-transparent memory ordering in conventional multiprocessors.	2009
89	Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races.	2010
88	Rotary router: an efficient architecture for CMP interconnection networks.	2007
88	The virtual write queue: coordinating DRAM and last-level cache policies.	2010
88	A case for heterogeneous on-chip interconnects for CMPs.	2011
88	Memory persistency.	2014
86	Silicon-photonic network architectures for scalable, power-efficient multi-chip systems.	2010
84	Evolution of thread-level parallelism in desktop applications.	2010
84	Catnap: energy proportional multiple network-on-chip.	2013
83	On the feasibility of online malware detection with performance counters.	2013
82	Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management.	2005
82	Hardware atomicity for reliable software speculation.	2007
82	A defect-tolerant accelerator for emerging high-performance applications.	2012
80	Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization.	2005
80	iSwitch: Coordinating and optimizing renewable energy powered server clusters.	2012
80	EIE: Efficient Inference Engine on Compressed Deep Neural Network.	2016
79	An abacus turn model for time/space-efficient reconfigurable routing.	2011
78	ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency.	2008
77	An intra-chip free-space optical interconnect.	2010
76	Rescue: A Microarchitecture for Testability and Defect Tolerance.	2005
76	Power model validation through thermal measurements.	2007
76	ShiDianNao: shifting vision processing closer to the sensor.	2015
75	Online Estimation of Architectural Vulnerability Factor for Soft Errors.	2008
75	Can traditional programming bridge the Ninja performance gap for parallel computing applications?	2012
74	Application-aware deadlock-free oblivious routing.	2009
74	Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs.	2011
74	Bypass and insertion algorithms for exclusive last-level caches.	2011
74	Simultaneous branch and warp interweaving for sustained GPU performance.	2012
74	Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.	2014
73	Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications.	2010
73	TimeWarp: Rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks.	2012
73	The Yin and Yang of power and performance for asymmetric hardware and managed software.	2012
72	A 64-bit stream processor architecture for scientific applications.	2007
71	iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures.	2008
70	Scalable Load and Store Processing in Latency Tolerant Processors.	2005
70	Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture.	2006
70	Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices.	2009
69	Techniques for Efficient Processing in Runahead Execution Engines.	2005
69	Automated design of application specific superscalar processors: an analytical approach.	2007
69	Polymorphic On-Chip Networks.	2008
69	Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security.	2011
68	Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs.	2006
68	Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors.	2010
66	Indirect adaptive routing on large scale interconnection networks.	2009
66	Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput.	2011
66	The role of optics in future high radix switch design.	2011
66	Navigating big data with high-throughput, energy-efficient data partitioning.	2013
65	A case for FAME: FPGA architecture model execution.	2010
65	Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems.	2011
65	End-to-end sequential consistency.	2012
64	An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems.	2005
64	Mechanisms for bounding vulnerabilities of processor structures.	2007
64	A case for random shortcut topologies for HPC interconnects.	2012
64	“Whare-map: heterogeneity in “”homogeneous”” warehouse-scale computers. “	2013
64	Design space exploration and optimization of path oblivious RAM in secure processors.	2013
62	Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches.	2006
62	Internet-scale service infrastructure efficiency.	2009
61	LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems.	2012
60	An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors.	2006
60	Simultaneous speculative threading: a novel pipeline architecture implemented in sun’s rock processor.	2009
60	ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations.	2010
60	SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading.	2011
60	ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates.	2013
60	A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness.	2013
59	Deconstructing Commodity Storage Clusters.	2005
59	Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors.	2006
59	Rapid identification of architectural bottlenecks via precise event counting.	2011
59	Probabilistic Shared Cache Management (PriSM).	2012
58	Branch regulation: Low-overhead protection from code reuse attacks.	2012
58	Triggered instructions: a control paradigm for spatially-programmed architectures.	2013
58	The CHERI capability model: Revisiting RISC in an age of risk.	2014
57	Thermal modeling and management of DRAM memory systems.	2007
57	Heracles: improving resource efficiency at scale.	2015
57	PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.	2015
56	Side-channel vulnerability factor: A metric for measuring information leakage.	2012
55	Cohesion: a hybrid memory model for accelerators.	2010
54	Memory Model = Instruction Reordering + Store Atomicity.	2006
54	LINQits: big data on little clients.	2013
53	An Evaluation Framework and Instruction Set Architecture for Ion-Trap Based Quantum Micro-Architectures.	2005
53	Profiling a warehouse-scale computer.	2015
52	WiDGET: Wisconsin decoupled grid execution tiles.	2010
52	SpecTLB: a mechanism for speculative address translation.	2011
52	Reducing memory reference energy with opportunistic virtual caching.	2012
52	SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip.	2013
51	Quantum Memory Hierarchies: Efficient Designs to Match Available Parallelism in Quantum Computing.	2006
51	Sampling + DMR: practical and low-overhead permanent fault detection.	2011
51	Tri-level-cell phase change memory: toward an efficient and reliable memory system.	2013
50	Reducing memory access latency with asymmetric DRAM bank organizations.	2013
50	Enabling preemptive multiprogramming on GPUs.	2014
49	CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures.	2012
49	Utility-based acceleration of multithreaded applications on asymmetric CMPs.	2013
48	Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors.	2007
48	Continuous real-world inputs can open up alternative accelerator designs.	2013
47	ParallAX: an architecture for real-time physics.	2007
47	i-NVMM: a secure non-volatile main memory system with incremental encryption.	2011
47	The dynamic granularity memory system.	2012
46	RENO - A Rename-Based Instruction Optimizer.	2005
46	Stream chaining: exploiting multiple levels of correlation in data prefetching.	2009
46	Physically Addressed Queueing (PAQ): Improving parallelism in Solid State Disks.	2012
45	Late-binding: enabling unordered load-store queues.	2007
45	Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction.	2008
45	iGPU: Exception support and speculative execution on GPUs.	2012
44	Store Buffer Design in First-Level Multibanked Data Caches.	2005
44	Multiple Instruction Stream Processor.	2006
44	Multi-execution: multicore caching for data-similar executions.	2009
43	VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization.	2007
42	Improving Program Efficiency by Packing Instructions into Registers.	2005
42	Area-Performance Trade-offs in Tiled Dataflow Architectures.	2006
42	Using hardware vulnerability factors to enhance AVF analysis.	2010
42	RADISH: Always-on sound and complete race detection in software and hardware.	2012
42	Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems.	2013
41	Performance and power of cache-based reconfigurable computing.	2009
41	A new perspective for efficient virtual-cache coherence.	2013
41	Flicker: a dynamically adaptive architecture for power limited multicore systems.	2013
41	An energy-efficient and scalable eDRAM-based register file architecture for GPGPU.	2013
40	Achieving Out-of-Order Performance with Almost In-Order Complexity.	2008
40	A fault tolerant, area efficient architecture for Shor’s factoring algorithm.	2009
40	Dynamic performance tuning for speculative threads.	2009
40	BOOM: Enabling mobile memory based low-power server DIMMs.	2012
39	Reducing Startup Time in Co-Designed Virtual Machines.	2006
39	Matrix scheduler reloaded.	2007
39	From Speculation to Security: Practical and Efficient Information Flow Tracking Using Speculative Hardware.	2008
39	Forwardflow: a scalable core for power-constrained CMPs.	2010
39	RETCON: transactional repair without replay.	2010
39	CPPC: correctable parity protected cache.	2011
39	AC-DIMM: associative computing with STT-MRAM.	2013
39	SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering.	2014
38	Atomic Vector Operations on Chip Multiprocessors.	2008
38	LReplay: a pending period based deterministic replay scheme.	2010
38	Automatic abstraction and fault tolerance in cortical microachitectures.	2011
38	Criticality stacks: identifying critical threads in parallel programs using synchronization behavior.	2013
37	Software-Controlled Priority Characterization of POWER5 Processor.	2008
37	SC2: A statistical compression cache scheme.	2014
36	Transparent control independence (TCI).	2007
36	ECMon: exposing cache events for monitoring.	2009
36	TLSync: support for multiple fast barriers using on-chip transmission lines.	2011
36	Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management.	2011
36	Buffer-on-board memory systems.	2012
36	WebCore: Architectural support for mobile Web browsing.	2014
36	SynFull: Synthetic traffic models capturing cache coherent behaviour.	2014
35	Dynamic Verification of Sequential Consistency.	2005
35	Running a Quantum Circuit at the Speed of Data.	2008
35	Watchdog: Hardware for safe and secure manual memory management and full memory safety.	2012
35	Improving memory scheduling via processor-side load criticality information.	2013
35	Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor.	2014
34	Virtualizing performance asymmetric multi-core systems.	2011
33	Tolerating Dependences Between Large Speculative Threads Via Sub-Threads.	2006
33	Intra-disk Parallelism: An Idea Whose Time Has Come.	2008
33	Demand-driven software race detection using hardware performance counters.	2011
33	Exploring memory consistency for massively-threaded throughput-oriented processors.	2013
33	STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies.	2014
33	Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation.	2014
32	Tolerating process variations in nanophotonic on-chip networks.	2012
32	FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion.	2012
32	The locality-aware adaptive cache coherence protocol.	2013
32	Resilient die-stacked DRAM caches.	2013
32	Data reorganization in memory using 3D-stacked DRAM.	2015
32	ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars.	2016
32	Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.	2016
31	Data marshaling for multi-core architectures.	2010
31	A case for globally shared-medium on-chip interconnect.	2011
31	Zombie memory: extending memory lifetime by reviving dead blocks.	2013
31	Rumba: an online quality management system for approximate computing.	2015
31	PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.	2016
30	Conditional Memory Ordering.	2006
30	Interconnection Networks for Scalable Quantum Computers.	2006
30	Cooperative boosting: needy versus greedy power management.	2013
29	Aquacore: a programmable architecture for microfluidics.	2007
29	Boosting single-thread performance in multi-core systems through fine-grain multi-threading.	2009
29	Timetraveler: exploiting acyclic races for optimizing memory race recording.	2010
29	Harmony: Collection and analysis of parallel block vectors.	2012
29	PARDIS: A programmable memory controller for the DDRx interfacing standards.	2012
29	Virtualizing power distribution in datacenters.	2013
29	The Dirty-Block Index.	2014
29	Architecting to achieve a billion requests per second throughput on a single key-value store server platform.	2015
27	Revisiting hardware-assisted page walks for virtualized systems.	2012
27	A first-order mechanistic model for architectural vulnerability factor.	2012
27	A micro-architectural analysis of switched photonic multi-chip interconnects.	2012
27	Agile, efficient virtualization power management with low-latency server power states.	2013
27	Redundant memory mappings for fast access to large memories.	2015
27	DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers.	2015
26	DNA-based molecular architecture with spatially localized components.	2013
26	Unifying on-chip and inter-node switching within the Anton 2 network.	2014
26	BlueDBM: an appliance for big data analytics.	2015
25	Ginger: control independence using tag rewriting.	2007
25	Flexible reference-counting-based hardware acceleration for garbage collection.	2009
25	A memory system design framework: creating smart memories.	2009
25	Rebound: scalable checkpointing for coherent shared memory.	2011
25	VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors.	2012
25	Protozoa: adaptive granularity cache coherence.	2013
25	QuickSAN: a storage area network for fast, distributed, solid state disks.	2013
25	Architecture implications of pads as a scarce resource.	2014
24	Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification.	2006
24	Boosting mobile GPU performance with a decoupled access/execute fragment processor.	2012
24	Studying multicore processor scaling via reuse distance analysis.	2013
24	Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8.	2015
24	Warped-compression: enabling power efficient GPUs through register compression.	2015
24	Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.	2016
23	Improving writeback efficiency with decoupled last-write prediction.	2012
23	Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures.	2012
23	SIMD divergence optimization through intra-warp compaction.	2013
23	Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation.	2013
22	Bit mapping for balanced PCM cell programming.	2013
22	Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors.	2013
22	Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization.	2014
22	A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.	2015
21	Distributed Arithmetic on a Quantum Multicomputer.	2006
21	Dynamic MIPS rate stabilization in out-of-order processors.	2009
21	Moguls: a model to explore the memory hierarchy for bandwidth improvements.	2011
21	WeeFence: toward making fences free in TSO.	2013
21	Going vertical in memory management: Handling multiplicity by multi-policy.	2014
21	SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers.	2014
21	Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs.	2015
21	CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads.	2015
21	A fully associative, tagless DRAM cache.	2015
20	Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors.	2010
20	Flexible auto-refresh: enabling scalable and energy-efficient DRAM refresh reductions.	2015
20	BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches.	2015
19	End-to-end register data-flow continuous self-test.	2009
19	OUTRIDER: efficient memory latency tolerance with decoupled strands.	2011
19	Inspection resistant memory: Architectural support for security from physical examination.	2012
19	QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs.	2013
19	Single-graph multiple flows: Energy efficient design alternative for GPGPUs.	2014
19	Exploring the potential of heterogeneous von neumann/dataflow execution models.	2015
18	HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs.	2014
18	Stash: have your scratchpad and cache it too.	2015
18	Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.	2016
17	The Future of Virtualization Technology.	2006
17	Necromancer: enhancing system throughput by animating dead cores.	2010
17	Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads.	2014
17	HIOS: A host interface I/O scheduler for Solid State Disks.	2014
17	Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.	2016
16	CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM.	2013
16	Towards sustainable in-situ server systems in the big data era.	2015
16	HEB: deploying and managing hybrid energy buffers for improving datacenter efficiency and economy.	2015
15	Counting Dependence Predictors.	2008
15	Microcoded Architectures for Ion-Tap Quantum Computers.	2008
15	Sentry: light-weight auxiliary memory access control.	2010
15	CODOMs: Protecting software with Code-centric memory Domains.	2014
15	Real-world design and evaluation of compiler-managed GPU redundant multithreading.	2014
15	EOLE: Paving the way for an effective implementation of value prediction.	2014
15	Multiple clone row DRAM: a low latency and area optimized DRAM.	2015
14	Ten ways to waste a parallel computer.	2009
14	Viper: Virtual pipelines for enhanced reliability.	2012
14	Enhancing effective throughput for transmission line-based bus.	2012
14	STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution.	2013
14	Secure I/O device sharing among virtual machines on multiple hosts.	2013
14	Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.	2015
14	Hi-fi playback: tolerating position errors in shift operations of racetrack memory.	2015
13	Performance and security lessons learned from virtualizing the alpha processor.	2007
13	The rebirth of neural networks.	2010
13	CRIB: consolidated rename, issue, and bypass.	2011
13	ArchRanker: A ranking approach to design space exploration.	2014
13	Fine-grain task aggregation and coordination on GPUs.	2014
13	GangES: Gang error simulation for hardware resiliency evaluation.	2014
13	Manycore network interfaces for in-memory rack-scale computing.	2015
13	Callback: efficient synchronization without invalidation with a directory just for spin-waiting.	2015
13	ArMOR: defending against memory consistency model mismatches in heterogeneous architectures.	2015
13	Accelerating Dependent Cache Misses with an Enhanced Memory Controller.	2016
13	Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.	2016
12	Fusion: design tradeoffs in coherent cache hierarchies for accelerators.	2015
12	SLIP: reducing wire energy in the memory hierarchy.	2015
12	Cambricon: An Instruction Set Architecture for Neural Networks.	2016
11	Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits.	2007
11	End-to-end performance forecasting: finding bottlenecks before they happen.	2009
11	Microarchitectural mechanisms to exploit value structure in SIMT architectures.	2013
11	OmniOrder: Directory-based conflict serialization of transactions.	2014
11	Harmonia: balancing compute and memory power in high-performance GPUs.	2015
11	RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision.	2016
10	Architectural implications of brick and mortar silicon manufacturing.	2007
10	Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors.	2009
10	Improving virtualization in the presence of software managed translation lookaside buffers.	2013
10	Increasing off-chip bandwidth in multi-core processors with switchable pins.	2014
10	Race Logic: A hardware acceleration for dynamic programming algorithms.	2014
10	Flexible software profiling of GPU architectures.	2015
9	Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines.	2005
9	A Two-Level Load/Store Queue Based on Execution Locality.	2008
9	Replay debugging: Leveraging record and replay for program debugging.	2014
9	Navigating the cache hierarchy with a single lookup.	2014
9	An examination of the architecture and system-level tradeoffs of employing steep slope devices in 3D CMPs.	2014
9	Avoiding core’s DUE&SDC via acoustic wave detectors and tailored error containment and recovery.	2014
9	Thermal time shifting: leveraging phase change materials to reduce cooling costs in warehouse-scale computers.	2015
9	Probable cause: the deanonymizing effects of approximate DRAM.	2015
9	COP: to compress and protect main memory.	2015
8	Setting an error detection infrastructure with low cost acoustic wave detectors.	2012
8	A low power and reliable charge pump design for Phase Change Memories.	2014
8	Improving the energy efficiency of Big Cores.	2014
8	Row-buffer decoupling: A case for low-latency DRAM microarchitecture.	2014
8	Reducing access latency of MLC PCMs through line striping.	2014
8	DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules.	2015
8	PrORAM: dynamic prefetcher for oblivious RAM.	2015
8	Computer performance microscopy with Shim.	2015
8	CloudMonatt: an architecture for security health monitoring and attestation of virtual machines in cloud computing.	2015
8	LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs.	2016
7	Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection.	2005
7	Moving the needle, computer architecture research in academe and industry.	2010
7	FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes.	2011
7	Non-race concurrency bug detection through order-sensitive critical sections.	2013
7	FASE: finding amplitude-modulated side-channel emanations.	2015
7	The load slice core microarchitecture.	2015
6	The End of Scaling? Revolutions in Technology and Microarchitecture as We Pass the 90 Nanometer Node.	2006
6	Fetch-Criticality Reduction through Control Independence.	2008
6	Accelerating asynchronous programs through event sneak peek.	2015
6	Reducing world switches in virtualized environment with flexible cross-world calls.	2015
6	Semantic locality and context-based prefetching using reinforcement learning.	2015
6	Efficient execution of memory access phases using dataflow specialization.	2015
6	Clean: a race detector with cleaner semantics.	2015
6	A variable warp size architecture.	2015
6	Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures.	2015
6	Dynamo: Facebook’s Data Center-Wide Power Management System.	2016
6	Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming.	2016
6	Agile Paging: Exceeding the Best of Nested and Shadow Paging.	2016
5	Improving the future by examining the past.	2010
5	Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability.	2012
5	Configurable fine-grain protection for multicore processor virtualization.	2012
5	Quantum rotations: a case study in static and dynamic machine-code generation for quantum computers.	2013
5	Unified address translation for memory-mapped SSDs with FlashMap.	2015
5	Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching.	2016
5	Automatic Generation of Efficient Accelerators for Reconfigurable Hardware.	2016
5	Energy Efficient Architecture for Graph Analytics Accelerators.	2016
5	MITTS: Memory Inter-arrival Time Traffic Shaping.	2016
5	Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching.	2016
5	Biscuit: A Framework for Near-Data Processing of Big Data Workloads.	2016
5	CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture.	2016
4	IVEC: off-chip memory integrity protection for both security and reliability.	2010
4	MemGuard: A low cost and energy efficient design to support and enhance memory system reliability.	2014
4	Fractal++: Closing the performance gap between fractal and conventional coherence.	2014
4	Branch vanguard: decomposing branch functionality into prediction and resolution instructions.	2015
4	SHRINK: reducing the ISA complexity via instruction recycling.	2015
4	MiSAR: minimalistic synchronization accelerator with resource overflow management.	2015
4	Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement.	2016
4	Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs.	2016
4	Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures.	2016
4	ASIC Clouds: Specializing the Datacenter.	2016
4	Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit.	2016
3	Releasing efficient beta cores to market early.	2011
3	BlockChop: Dynamic squash elimination for hybrid processor architecture.	2012
3	Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol.	2014
3	MBus: an ultra-low power interconnect bus for next generation nanopower systems.	2015
3	Cost-effective speculative scheduling in high performance processors.	2015
3	Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference.	2016
3	Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers.	2016
2	Shared caches in multicores: the good, the bad, and the ugly.	2010
2	Deconfigurable microprocessor architectures for silicon debug acceleration.	2013
2	VIP: virtualizing IP chains on handheld platforms.	2015
2	Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures.	2016
2	ARM Virtualization: Performance and Architectural Implications.	2016
2	Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems.	2016
2	PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures.	2016
2	Future Vector Microprocessor Extensions for Data Aggregations.	2016
2	LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches.	2016
2	Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing.	2016
2	Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration.	2016
2	ActivePointers: A Case for Software Address Translation on GPUs.	2016
2	The Anytime Automaton.	2016
1	Computer Architecture Research and Future Microprocessors: Where Do We Go from Here?	2006
1	Efficient digital neurons for large scale cortical architectures.	2014
1	FaultHound: value-locality-based soft-fault tolerance.	2015
1	Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units.	2016
1	Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL.	2016
1	Decoupling Loads for Nano-Instruction Set Computers.	2016
1	Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading.	2016
1	Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity.	2016
1	Boosting Access Parallelism to PCM-Based Main Memory.	2016
1	Power Attack Defense: Securing Battery-Backed Data Centers.	2016
1	Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors.	2016
1	APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs.	2016
1	Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation.	2016
1	Asymmetry-Aware Work-Stealing Runtimes.	2016
1	XED: Exposing On-Die Error Detection Information for Strong Memory Reliability.	2016
1	All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory.	2016
0	Message from the General Chair.	2006
0	Message from the Program Chair.	2006
0	SIGARCH Guidelines.	2006
0	LaZy superscalar.	2015
0	DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric.	2016
0	Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs.	2016
0	Production-Run Software Failure Diagnosis via Adaptive Communication Tracking.	2016
0	RelaxFault Memory Repair.	2016
0	Evaluation of an Analog Accelerator for Linear Algebra.	2016
0	Base-Victim Compression: An Opportunistic Cache Compression Architecture.	2016

2016¶

Cited by	Paper title
80	EIE: Efficient Inference Engine on Compressed Deep Neural Network.
32	ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars.
32	Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.
31	PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.
24	Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.
18	Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.
17	Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.
13	Accelerating Dependent Cache Misses with an Enhanced Memory Controller.
13	Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.
12	Cambricon: An Instruction Set Architecture for Neural Networks.
11	RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision.
8	LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs.
6	Dynamo: Facebook’s Data Center-Wide Power Management System.
6	Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming.
6	Agile Paging: Exceeding the Best of Nested and Shadow Paging.
5	Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching.
5	Automatic Generation of Efficient Accelerators for Reconfigurable Hardware.
5	Energy Efficient Architecture for Graph Analytics Accelerators.
5	MITTS: Memory Inter-arrival Time Traffic Shaping.
5	Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching.
5	Biscuit: A Framework for Near-Data Processing of Big Data Workloads.
5	CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture.
4	Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement.
4	Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs.
4	Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures.
4	ASIC Clouds: Specializing the Datacenter.
4	Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit.
3	Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference.
3	Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers.
2	Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures.
2	ARM Virtualization: Performance and Architectural Implications.
2	Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems.
2	PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures.
2	Future Vector Microprocessor Extensions for Data Aggregations.
2	LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches.
2	Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing.
2	Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration.
2	ActivePointers: A Case for Software Address Translation on GPUs.
2	The Anytime Automaton.
1	Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units.
1	Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL.
1	Decoupling Loads for Nano-Instruction Set Computers.
1	Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading.
1	Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity.
1	Boosting Access Parallelism to PCM-Based Main Memory.
1	Power Attack Defense: Securing Battery-Backed Data Centers.
1	Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors.
1	APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs.
1	Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation.
1	Asymmetry-Aware Work-Stealing Runtimes.
1	XED: Exposing On-Die Error Detection Information for Strong Memory Reliability.
1	All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory.
0	DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric.
0	Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs.
0	Production-Run Software Failure Diagnosis via Adaptive Communication Tracking.
0	RelaxFault Memory Repair.
0	Evaluation of an Analog Accelerator for Linear Algebra.
0	Base-Victim Compression: An Opportunistic Cache Compression Architecture.

2015¶

Cited by	Paper title
98	A scalable processing-in-memory accelerator for parallel graph processing.
76	ShiDianNao: shifting vision processing closer to the sensor.
57	Heracles: improving resource efficiency at scale.
57	PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.
53	Profiling a warehouse-scale computer.
32	Data reorganization in memory using 3D-stacked DRAM.
31	Rumba: an online quality management system for approximate computing.
29	Architecting to achieve a billion requests per second throughput on a single key-value store server platform.
27	Redundant memory mappings for fast access to large memories.
27	DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers.
26	BlueDBM: an appliance for big data analytics.
24	Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8.
24	Warped-compression: enabling power efficient GPUs through register compression.
22	A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.
21	Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs.
21	CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads.
21	A fully associative, tagless DRAM cache.
20	Flexible auto-refresh: enabling scalable and energy-efficient DRAM refresh reductions.
20	BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches.
19	Exploring the potential of heterogeneous von neumann/dataflow execution models.
18	Stash: have your scratchpad and cache it too.
16	Towards sustainable in-situ server systems in the big data era.
16	HEB: deploying and managing hybrid energy buffers for improving datacenter efficiency and economy.
15	Multiple clone row DRAM: a low latency and area optimized DRAM.
14	Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.
14	Hi-fi playback: tolerating position errors in shift operations of racetrack memory.
13	Manycore network interfaces for in-memory rack-scale computing.
13	Callback: efficient synchronization without invalidation with a directory just for spin-waiting.
13	ArMOR: defending against memory consistency model mismatches in heterogeneous architectures.
12	Fusion: design tradeoffs in coherent cache hierarchies for accelerators.
12	SLIP: reducing wire energy in the memory hierarchy.
11	Harmonia: balancing compute and memory power in high-performance GPUs.
10	Flexible software profiling of GPU architectures.
9	Thermal time shifting: leveraging phase change materials to reduce cooling costs in warehouse-scale computers.
9	Probable cause: the deanonymizing effects of approximate DRAM.
9	COP: to compress and protect main memory.
8	DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules.
8	PrORAM: dynamic prefetcher for oblivious RAM.
8	Computer performance microscopy with Shim.
8	CloudMonatt: an architecture for security health monitoring and attestation of virtual machines in cloud computing.
7	FASE: finding amplitude-modulated side-channel emanations.
7	The load slice core microarchitecture.
6	Accelerating asynchronous programs through event sneak peek.
6	Reducing world switches in virtualized environment with flexible cross-world calls.
6	Semantic locality and context-based prefetching using reinforcement learning.
6	Efficient execution of memory access phases using dataflow specialization.
6	Clean: a race detector with cleaner semantics.
6	A variable warp size architecture.
6	Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures.
5	Unified address translation for memory-mapped SSDs with FlashMap.
4	Branch vanguard: decomposing branch functionality into prediction and resolution instructions.
4	SHRINK: reducing the ISA complexity via instruction recycling.
4	MiSAR: minimalistic synchronization accelerator with resource overflow management.
3	MBus: an ultra-low power interconnect bus for next generation nanopower systems.
3	Cost-effective speculative scheduling in high performance processors.
2	VIP: virtualizing IP chains on handheld platforms.
1	FaultHound: value-locality-based soft-fault tolerance.
0	LaZy superscalar.

2014¶

Cited by	Paper title
353	A reconfigurable fabric for accelerating large-scale datacenter services.
131	Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.
101	Towards energy proportionality for large-scale latency-critical workloads.
95	General-purpose code acceleration with limited-precision analog computation.
88	Memory persistency.
74	Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.
58	The CHERI capability model: Revisiting RISC in an age of risk.
50	Enabling preemptive multiprogramming on GPUs.
39	SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering.
37	SC2: A statistical compression cache scheme.
36	WebCore: Architectural support for mobile Web browsing.
36	SynFull: Synthetic traffic models capturing cache coherent behaviour.
35	Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor.
33	STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies.
33	Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation.
29	The Dirty-Block Index.
26	Unifying on-chip and inter-node switching within the Anton 2 network.
25	Architecture implications of pads as a scarce resource.
22	Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization.
21	Going vertical in memory management: Handling multiplicity by multi-policy.
21	SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers.
19	Single-graph multiple flows: Energy efficient design alternative for GPGPUs.
18	HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs.
17	Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads.
17	HIOS: A host interface I/O scheduler for Solid State Disks.
15	CODOMs: Protecting software with Code-centric memory Domains.
15	Real-world design and evaluation of compiler-managed GPU redundant multithreading.
15	EOLE: Paving the way for an effective implementation of value prediction.
13	ArchRanker: A ranking approach to design space exploration.
13	Fine-grain task aggregation and coordination on GPUs.
13	GangES: Gang error simulation for hardware resiliency evaluation.
11	OmniOrder: Directory-based conflict serialization of transactions.
10	Increasing off-chip bandwidth in multi-core processors with switchable pins.
10	Race Logic: A hardware acceleration for dynamic programming algorithms.
9	Replay debugging: Leveraging record and replay for program debugging.
9	Navigating the cache hierarchy with a single lookup.
9	An examination of the architecture and system-level tradeoffs of employing steep slope devices in 3D CMPs.
9	Avoiding core’s DUE&SDC via acoustic wave detectors and tailored error containment and recovery.
8	A low power and reliable charge pump design for Phase Change Memories.
8	Improving the energy efficiency of Big Cores.
8	Row-buffer decoupling: A case for low-latency DRAM microarchitecture.
8	Reducing access latency of MLC PCMs through line striping.
4	MemGuard: A low cost and energy efficient design to support and enhance memory system reliability.
4	Fractal++: Closing the performance gap between fractal and conventional coherence.
3	Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol.
1	Efficient digital neurons for large scale cortical architectures.

2013¶

Cited by	Paper title
272	GPUWattch: enabling energy optimizations in GPGPUs.
143	ZSim: fast and accurate microarchitectural simulation of thousand-core systems.
123	Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers.
121	Thin servers with smart pipes: designing SoC accelerators for memcached.
119	An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.
107	Efficient virtual memory for big memory servers.
105	Convolution engine: balancing efficiency&flexibility in specialized computing.
96	Orchestrated scheduling and prefetching for GPGPUs.
95	Robust architectural support for transactional memory in the power architecture.
94	Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache.
84	Catnap: energy proportional multiple network-on-chip.
83	On the feasibility of online malware detection with performance counters.
66	Navigating big data with high-throughput, energy-efficient data partitioning.
64	“Whare-map: heterogeneity in “”homogeneous”” warehouse-scale computers. “
64	Design space exploration and optimization of path oblivious RAM in secure processors.
60	ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates.
60	A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness.
58	Triggered instructions: a control paradigm for spatially-programmed architectures.
54	LINQits: big data on little clients.
52	SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip.
51	Tri-level-cell phase change memory: toward an efficient and reliable memory system.
50	Reducing memory access latency with asymmetric DRAM bank organizations.
49	Utility-based acceleration of multithreaded applications on asymmetric CMPs.
48	Continuous real-world inputs can open up alternative accelerator designs.
42	Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems.
41	A new perspective for efficient virtual-cache coherence.
41	Flicker: a dynamically adaptive architecture for power limited multicore systems.
41	An energy-efficient and scalable eDRAM-based register file architecture for GPGPU.
39	AC-DIMM: associative computing with STT-MRAM.
38	Criticality stacks: identifying critical threads in parallel programs using synchronization behavior.
35	Improving memory scheduling via processor-side load criticality information.
33	Exploring memory consistency for massively-threaded throughput-oriented processors.
32	The locality-aware adaptive cache coherence protocol.
32	Resilient die-stacked DRAM caches.
31	Zombie memory: extending memory lifetime by reviving dead blocks.
30	Cooperative boosting: needy versus greedy power management.
29	Virtualizing power distribution in datacenters.
27	Agile, efficient virtualization power management with low-latency server power states.
26	DNA-based molecular architecture with spatially localized components.
25	Protozoa: adaptive granularity cache coherence.
25	QuickSAN: a storage area network for fast, distributed, solid state disks.
24	Studying multicore processor scaling via reuse distance analysis.
23	SIMD divergence optimization through intra-warp compaction.
23	Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation.
22	Bit mapping for balanced PCM cell programming.
22	Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors.
21	WeeFence: toward making fences free in TSO.
19	QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs.
16	CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM.
14	STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution.
14	Secure I/O device sharing among virtual machines on multiple hosts.
11	Microarchitectural mechanisms to exploit value structure in SIMT architectures.
10	Improving virtualization in the presence of software managed translation lookaside buffers.
7	Non-race concurrency bug detection through order-sensitive critical sections.
5	Quantum rotations: a case study in static and dynamic machine-code generation for quantum computers.
2	Deconfigurable microprocessor architectures for silicon debug acceleration.

2012¶

Cited by	Paper title
236	RAIDR: Retention-aware intelligent DRAM refresh.
205	Scheduling heterogeneous multi-cores through performance impact estimation (PIE).
145	Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.
144	Towards energy-proportional datacenter memory with mobile DRAM.
140	A case for exploiting subarray-level parallelism (SALP) in DRAM.
136	Scale-out processors.
118	Managing distributed UPS energy for effective power capping in data centers.
106	PreSET: Improving performance of phase change memories by exploiting asymmetry in write times.
82	A defect-tolerant accelerator for emerging high-performance applications.
80	iSwitch: Coordinating and optimizing renewable energy powered server clusters.
75	Can traditional programming bridge the Ninja performance gap for parallel computing applications?
74	Simultaneous branch and warp interweaving for sustained GPU performance.
73	TimeWarp: Rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks.
73	The Yin and Yang of power and performance for asymmetric hardware and managed software.
65	End-to-end sequential consistency.
64	A case for random shortcut topologies for HPC interconnects.
61	LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems.
59	Probabilistic Shared Cache Management (PriSM).
58	Branch regulation: Low-overhead protection from code reuse attacks.
56	Side-channel vulnerability factor: A metric for measuring information leakage.
52	Reducing memory reference energy with opportunistic virtual caching.
49	CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures.
47	The dynamic granularity memory system.
46	Physically Addressed Queueing (PAQ): Improving parallelism in Solid State Disks.
45	iGPU: Exception support and speculative execution on GPUs.
42	RADISH: Always-on sound and complete race detection in software and hardware.
40	BOOM: Enabling mobile memory based low-power server DIMMs.
36	Buffer-on-board memory systems.
35	Watchdog: Hardware for safe and secure manual memory management and full memory safety.
32	Tolerating process variations in nanophotonic on-chip networks.
32	FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion.
29	Harmony: Collection and analysis of parallel block vectors.
29	PARDIS: A programmable memory controller for the DDRx interfacing standards.
27	Revisiting hardware-assisted page walks for virtualized systems.
27	A first-order mechanistic model for architectural vulnerability factor.
27	A micro-architectural analysis of switched photonic multi-chip interconnects.
25	VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors.
24	Boosting mobile GPU performance with a decoupled access/execute fragment processor.
23	Improving writeback efficiency with decoupled last-write prediction.
23	Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures.
19	Inspection resistant memory: Architectural support for security from physical examination.
14	Viper: Virtual pipelines for enhanced reliability.
14	Enhancing effective throughput for transmission line-based bus.
8	Setting an error detection infrastructure with low cost acoustic wave detectors.
5	Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability.
5	Configurable fine-grain protection for multicore processor virtualization.
3	BlockChop: Dynamic squash elimination for hybrid processor architecture.

2011¶

Cited by	Paper title
1203	Dark silicon and the end of multicore scaling.
295	Power management of online data-intensive services.
175	Energy-efficient mechanisms for managing thread context in throughput processors.
172	The impact of memory subsystem resource sharing on datacenter applications.
160	Benefits and limitations of tapping into stored energy for datacenters.
157	Vantage: scalable and efficient fine-grain cache partitioning.
134	Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.
132	Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks.
120	FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template.
114	DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip.
111	Energy-efficient cache design using variable-strength error-correcting codes.
98	Scalable power control for many-core architectures running multi-threaded applications.
98	Prefetch-aware shared resource management for multi-core systems.
97	Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators.
88	A case for heterogeneous on-chip interconnects for CMPs.
79	An abacus turn model for time/space-efficient reconfigurable routing.
74	Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs.
74	Bypass and insertion algorithms for exclusive last-level caches.
69	Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security.
66	Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput.
66	The role of optics in future high radix switch design.
65	Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems.
60	SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading.
59	Rapid identification of architectural bottlenecks via precise event counting.
52	SpecTLB: a mechanism for speculative address translation.
51	Sampling + DMR: practical and low-overhead permanent fault detection.
47	i-NVMM: a secure non-volatile main memory system with incremental encryption.
39	CPPC: correctable parity protected cache.
38	Automatic abstraction and fault tolerance in cortical microachitectures.
36	TLSync: support for multiple fast barriers using on-chip transmission lines.
36	Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management.
34	Virtualizing performance asymmetric multi-core systems.
33	Demand-driven software race detection using hardware performance counters.
31	A case for globally shared-medium on-chip interconnect.
25	Rebound: scalable checkpointing for coherent shared memory.
21	Moguls: a model to explore the memory hierarchy for bandwidth improvements.
19	OUTRIDER: efficient memory latency tolerance with decoupled strands.
13	CRIB: consolidated rename, issue, and bypass.
7	FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes.
3	Releasing efficient beta cores to market early.

2010¶

Cited by	Paper title
756	Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.
413	High performance cache replacement using re-reference interval prediction (RRIP).
389	An integrated GPU power and performance model.
361	Energy proportional datacenter networks.
318	Understanding sources of inefficiency in general-purpose chips.
254	NoHype: virtualized cloud infrastructure without the virtualization.
232	Use ECP, not ECC, for hard failures in resistive memories.
211	Rethinking DRAM design and organization for energy-constrained multi-cores.
195	Relax: an architectural framework for software recovery of hardware faults.
189	Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping.
186	Dynamic warp subdivision for integrated branch and memory divergence tolerance.
173	Reducing cache power with low-cost, multi-bit error-correcting codes.
172	Web search using mobile cores: quantifying and mitigating the price of efficiency.
143	Morphable memory system: a robust architecture for exploiting multi-level phase change memories.
141	Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing.
139	Aérgia: exploiting packet latency slack in on-chip networks.
124	A dynamically configurable coprocessor for convolutional neural networks.
124	Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis.
110	Translation caching: skip, don’t walk (the page table).
100	Re-architecting DRAM memory systems with monolithically integrated silicon photonics.
99	The impact of management operations on the virtualized datacenter.
99	SieveStore: a highly-selective, ensemble-level disk cache for cost-performance.
94	Modeling critical sections in Amdahl’s law and its implications for multicore design.
89	Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races.
88	The virtual write queue: coordinating DRAM and last-level cache policies.
86	Silicon-photonic network architectures for scalable, power-efficient multi-chip systems.
84	Evolution of thread-level parallelism in desktop applications.
77	An intra-chip free-space optical interconnect.
73	Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications.
68	Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors.
65	A case for FAME: FPGA architecture model execution.
60	ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations.
55	Cohesion: a hybrid memory model for accelerators.
52	WiDGET: Wisconsin decoupled grid execution tiles.
42	Using hardware vulnerability factors to enhance AVF analysis.
39	Forwardflow: a scalable core for power-constrained CMPs.
39	RETCON: transactional repair without replay.
38	LReplay: a pending period based deterministic replay scheme.
31	Data marshaling for multi-core architectures.
29	Timetraveler: exploiting acyclic races for optimizing memory race recording.
20	Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors.
17	Necromancer: enhancing system throughput by animating dead cores.
15	Sentry: light-weight auxiliary memory access control.
13	The rebirth of neural networks.
7	Moving the needle, computer architecture research in academe and industry.
5	Improving the future by examining the past.
4	IVEC: off-chip memory integrity protection for both security and reliability.
2	Shared caches in multicores: the good, the bad, and the ugly.

2009¶

Cited by	Paper title
937	Scalable high performance main memory system using phase-change memory technology.
875	Architecting phase change memory as a scalable dram alternative.
644	A durable and energy efficient main memory using phase change memory technology.
532	An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness.
380	Reactive NUCA: near-optimal block placement and replication in distributed caches.
337	Firefly: illuminating future network-on-chip with nanophotonics.
316	A case for bufferless routing in on-chip networks.
306	Hybrid cache architecture with disparate memory technologies.
267	PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches.
238	Hardware support for WCET analysis of hard real-time multicore systems.
236	Scaling the bandwidth wall: challenges in and avenues for CMP scaling.
225	Thread motion: fine-grained power management for multi-core systems.
190	The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization.
174	Temperature-constrained power control for chip multiprocessors with online model estimation.
170	Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors.
169	Phastlane: a rapid transit optical routing network.
159	Rigel: an architecture and scalable programming interface for a 1000-core accelerator.
154	Disaggregated memory for expansion and sharing in blade servers.
137	Achieving predictable performance through better memory controller placement in many-core CMPs.
137	A case for an interleaving constrained shared-memory multi-processor.
124	Architectural core salvaging in a multi-core processor for hard-error tolerance.
106	Memory mapped ECC: low-cost error protection for last level caches.
100	SigRace: signature-based data race detection.
96	Spatio-temporal memory streaming.
94	AnySP: anytime anywhere anyway signal processing.
90	InvisiFence: performance-transparent memory ordering in conventional multiprocessors.
74	Application-aware deadlock-free oblivious routing.
70	Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices.
66	Indirect adaptive routing on large scale interconnection networks.
62	Internet-scale service infrastructure efficiency.
60	Simultaneous speculative threading: a novel pipeline architecture implemented in sun’s rock processor.
46	Stream chaining: exploiting multiple levels of correlation in data prefetching.
44	Multi-execution: multicore caching for data-similar executions.
41	Performance and power of cache-based reconfigurable computing.
40	A fault tolerant, area efficient architecture for Shor’s factoring algorithm.
40	Dynamic performance tuning for speculative threads.
36	ECMon: exposing cache events for monitoring.
29	Boosting single-thread performance in multi-core systems through fine-grain multi-threading.
25	Flexible reference-counting-based hardware acceleration for garbage collection.
25	A memory system design framework: creating smart memories.
21	Dynamic MIPS rate stabilization in out-of-order processors.
19	End-to-end register data-flow continuous self-test.
14	Ten ways to waste a parallel computer.
11	End-to-end performance forecasting: finding bottlenecks before they happen.
10	Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors.

2008¶

Cited by	Paper title
625	Corona: System Implications of Emerging Nanophotonic Technology.
588	3D-Stacked Memory Architectures for Multi-core Processors.
448	Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems.
367	Technology-Driven, Highly-Scalable Dragonfly Topology.
292	Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors.
290	Improving NAND Flash Based Disk Caches.
288	Self-Optimizing Memory Controllers: A Reinforcement Learning Approach.
225	Trading off Cache Capacity for Reliability to Enable Low Voltage Operation.
222	Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support.
221	MIRA: A Multi-layered On-Chip Interconnect Router Architecture.
217	DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently.
217	A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies.
216	Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments.
179	Rerun: Exploiting Episodes for Lightweight Memory Race Recording.
160	Flexible Decoupled Transactional Memory Support.
150	Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks.
134	TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory.
130	Flexible Hardware Acceleration for Instruction-Grain Program Monitoring.
118	Atom-Aid: Detecting and Surviving Atomicity Violations.
113	Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory.
113	VEAL: Virtualized Execution Accelerator for Loops.
91	A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime.
78	ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency.
75	Online Estimation of Architectural Vulnerability Factor for Soft Errors.
71	iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures.
69	Polymorphic On-Chip Networks.
45	Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction.
40	Achieving Out-of-Order Performance with Almost In-Order Complexity.
39	From Speculation to Security: Practical and Efficient Information Flow Tracking Using Speculative Hardware.
38	Atomic Vector Operations on Chip Multiprocessors.
37	Software-Controlled Priority Characterization of POWER5 Processor.
35	Running a Quantum Circuit at the Speed of Data.
33	Intra-disk Parallelism: An Idea Whose Time Has Come.
15	Counting Dependence Predictors.
15	Microcoded Architectures for Ion-Tap Quantum Computers.
9	A Two-Level Load/Store Queue Based on Execution Locality.
6	Fetch-Criticality Reduction through Control Independence.

2007¶

Cited by	Paper title
1553	Power provisioning for a warehouse-sized computer.
547	Adaptive insertion policies for high performance caching.
451	Anton, a special-purpose machine for molecular dynamics simulation.
368	Express virtual channels: towards the ideal interconnection fabric.
365	An effective hybrid transactional memory system with strong isolation guarantees.
363	Flattened butterfly: a cost-efficient topology for high-radix networks.
294	Core fusion: accommodating software diversity in chip multiprocessors.
285	Raksha: a flexible information flow architecture for software security.
265	A novel dimensionally-decomposed router for on-chip communication in 3D architectures.
263	New cache designs for thwarting software cache-based side channel attacks.
258	Performance pathologies in hardware transactional memory.
225	Carbon: architectural support for fine-grained parallelism on chip multiprocessors.
223	BulkSC: bulk enforcement of sequential consistency.
207	Virtual hierarchies to support server consolidation.
192	Configurable isolation: building high availability systems with commodity multi-core processors.
192	Virtual private caches.
162	Making the fast case common and the uncommon case simple in unbounded transactional memory.
157	Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite.
153	ReCycle: : pipeline adaptation to tolerate process variation.
138	An integrated hardware-software approach to flexible transactional memory.
137	Limiting the power consumption of main memory.
127	Comparing memory systems for chip multiprocessors.
125	Dynamic prediction of architectural vulnerability from microarchitectural state.
109	Interconnect design considerations for large NUCA caches.
103	Examining ACE analysis reliability estimates using fault-injection.
99	Mechanisms for store-wait-free multiprocessors.
96	MetaTM//TxLinux: transactional memory for an operating system.
90	Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures.
88	Rotary router: an efficient architecture for CMP interconnection networks.
82	Hardware atomicity for reliable software speculation.
76	Power model validation through thermal measurements.
72	A 64-bit stream processor architecture for scientific applications.
69	Automated design of application specific superscalar processors: an analytical approach.
64	Mechanisms for bounding vulnerabilities of processor structures.
57	Thermal modeling and management of DRAM memory systems.
48	Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors.
47	ParallAX: an architecture for real-time physics.
45	Late-binding: enabling unordered load-store queues.
43	VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization.
39	Matrix scheduler reloaded.
36	Transparent control independence (TCI).
29	Aquacore: a programmable architecture for microfluidics.
25	Ginger: control independence using tag rewriting.
13	Performance and security lessons learned from virtualizing the alpha processor.
11	Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits.
10	Architectural implications of brick and mortar silicon manufacturing.

2006¶

Cited by	Paper title
539	Techniques for Multicore Thermal Management: Classification and New Exploration.
477	Cooperative Caching for Chip Multiprocessors.
427	Design and Management of 3D Chip Multiprocessors Using Network-in-Memory.
376	Ensemble-level Power Management for Dense Blade Servers.
327	Bulk Disambiguation of Speculative Threads in Multiprocessors.
289	A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching.
267	A Case for MLP-Aware Cache Replacement.
253	SODA: A Low-power Architecture For Software Radio.
229	A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks.
217	Architectural Semantics for Practical Transactional Memory.
208	The BlackWidow High-Radix Clos Network.
165	Spatial Memory Streaming.
139	Improving Cost, Performance, and Security of Memory Encryption and Authentication.
131	Interconnect-Aware Coherence Protocols for Chip Multiprocessors.
120	Learning-Based SMT Processor Resource Distribution via Hill-Climbing.
120	TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time.
70	Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture.
68	Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs.
62	Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches.
60	An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors.
59	Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors.
54	Memory Model = Instruction Reordering + Store Atomicity.
51	Quantum Memory Hierarchies: Efficient Designs to Match Available Parallelism in Quantum Computing.
44	Multiple Instruction Stream Processor.
42	Area-Performance Trade-offs in Tiled Dataflow Architectures.
39	Reducing Startup Time in Co-Designed Virtual Machines.
33	Tolerating Dependences Between Large Speculative Threads Via Sub-Threads.
30	Conditional Memory Ordering.
30	Interconnection Networks for Scalable Quantum Computers.
24	Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification.
21	Distributed Arithmetic on a Quantum Multicomputer.
17	The Future of Virtualization Technology.
6	The End of Scaling? Revolutions in Technology and Microarchitecture as We Pass the 90 Nanometer Node.
1	Computer Architecture Research and Future Microprocessors: Where Do We Go from Here?
0	Message from the General Chair.
0	Message from the Program Chair.
0	SIGARCH Guidelines.

2005¶

Cited by	Paper title
610	Continuous Optimization.
511	Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling.
497	Virtualizing Transactional Memory.
411	Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors.
341	BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging.
341	Optimizing Replication, Communication, and Capacity Allocation in CMPs.
334	A High Throughput String Matching Architecture for Intrusion Detection and Prevention.
305	The Impact of Performance Asymmetry in Emerging Multicore Architectures.
263	Mitigating Amdahl’s Law through EPI Throttling.
232	Microarchitecture of a High-Radix Router.
230	Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions.
227	Exploiting Structural Duplication for Lifetime Reliability Enhancement.
220	The V-Way Cache: Demand Based Associativity via Global Replacement.
215	Computing Architectural Vulnerability Factors for Address-Based Structures.
211	Architecture for Protecting Critical Secrets in Microprocessors.
195	Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks.
195	An Ultra Low Power System Architecture for Sensor Network Applications.
191	RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence.
178	An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors.
150	Design and Evaluation of Hybrid Fault-Detection Systems.
148	Opportunistic Transient-Fault Detection.
148	Direct Cache Access for High Bandwidth Network I/O.
125	A Robust Main-Memory Compression Scheme.
124	Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking.
122	Temporal Streaming of Shared Memory.
118	Analysis of the O-GEometric History Length Branch Predictor.
109	Energy Optimization of Subthreshold-Voltage Sensor Network Processors.
101	Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors.
99	High Efficiency Counter Mode Security Architecture via Prediction and Precomputation.
98	A Tree Based Router Search Engine Architecture with Single Port Memories.
95	Piecewise Linear Branch Prediction.
82	Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management.
80	Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization.
76	Rescue: A Microarchitecture for Testability and Defect Tolerance.
70	Scalable Load and Store Processing in Latency Tolerant Processors.
69	Techniques for Efficient Processing in Runahead Execution Engines.
64	An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems.
59	Deconstructing Commodity Storage Clusters.
53	An Evaluation Framework and Instruction Set Architecture for Ion-Trap Based Quantum Micro-Architectures.
46	RENO - A Rename-Based Instruction Optimizer.
44	Store Buffer Design in First-Level Multibanked Data Caches.
42	Improving Program Efficiency by Packing Instructions into Registers.
35	Dynamic Verification of Sequential Consistency.
9	Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines.
7	Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection.