MICRO¶

All¶

Cited by	Paper title	Year
1501	McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures.	2009
941	Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches.	2006
603	An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget.	2006
552	Die Stacking (3D) Microarchitecture.	2006
470	Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping.	2009
465	Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling.	2009
449	Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0.	2007
420	Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors.	2007
370	LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks.	2006
365	Flattened Butterfly Topology for On-Chip Networks.	2007
358	Managing Distributed, Shared L2 Caches through OS-Level Page Allocation.	2006
358	Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow.	2007
335	Characterizing flash memory: anomalies, observations, and applications.	2009
334	Fair Queuing Memory Systems.	2006
334	Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance.	2009
321	Neural Acceleration for General-Purpose Approximate Programs.	2012
302	Leveraging Optical Technology in Future Bus-based Chip Multiprocessors.	2006
299	ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers.	2006
296	Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures.	2009
294	Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior.	2010
285	Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations.	2011
282	Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management.	2006
278	Automatic Thread Extraction with Decoupled Software Pipelining.	2005
253	Improving GPU performance via large warps and two-level warp scheduling.	2011
247	Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding.	2007
241	Argus: Low-Cost, Comprehensive Error Detection in Simple Cores.	2007
237	ASR: Adaptive Selective Replication for CMP Caches.	2006
211	A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance.	2005
211	Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach.	2008
210	Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?	2010
210	Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories.	2010
209	Architectural Support for Software Transactional Memory.	2006
206	Cache-Conscious Wavefront Scheduling.	2012
203	Mini-rank: Adaptive DRAM architecture for improving memory power efficiency.	2008
197	A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs.	2007
192	Composable Lightweight Processors.	2007
185	Facelift: Hiding and slowing down aging in multicores.	2008
184	Penelope: The NBTI-Aware Processor.	2007
180	Distributed Microarchitectural Protocols in the TRIPS Prototype Processor.	2006
179	Revisiting the Sequential Programming Model for Multi-Core.	2007
172	Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs.	2007
170	Multi retention level STT-RAM cache designs with a dynamic refresh scheme.	2011
169	DaDianNao: A Machine-Learning Supercomputer.	2014
166	Reducing memory interference in multicore systems via application-aware memory channel partitioning.	2011
165	Application-aware prioritization mechanisms for on-chip networks.	2009
159	Reunion: Complexity-Effective Multicore Redundancy.	2006
157	FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators.	2007
156	Stream Programming on General-Purpose Processors.	2005
153	Prefetch-Aware DRAM Controllers.	2008
148	Low-cost router microarchitecture for on-chip networks.	2009
148	SCARAB: a single cycle adaptive routing and bufferless network.	2009
147	Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency.	2008
146	The ZCache: Decoupling Ways and Associativity.	2010
146	Pack&Cap: adaptive DVFS and thread packing under power caps.	2011
143	Understanding the Energy Consumption of Dynamic Random Access Memories.	2010
139	A tagless coherence directory.	2009
137	In-Network Cache Coherence.	2006
137	Copy or Discard execution model for speculative parallelization on multicores.	2008
136	Implementing Signatures for Transactional Memory.	2007
136	Approximate storage in solid-state memories.	2013
135	SAFER: Stuck-At-Fault Error Recovery for Memories.	2010
134	Improving cache lifetime reliability at ultra-low voltages.	2009
130	A Predictive Performance Model for Superscalar Processors.	2006
129	A novel cache architecture with enhanced performance and security.	2008
129	Coordinated control of multiple prefetchers in multi-core systems.	2009
129	SHiP: signature-based hit predictor for high performance caching.	2011
128	Mitigating the Impact of Process Variations on Processor Register Files and Execution Units.	2006
127	A Framework for Providing Quality of Service in Chip Multi-Processors.	2007
125	Transactional Memory Architecture and Implementation for IBM System Z.	2012
123	Efficiently enabling conventional block sizes for very large die-stacked DRAM caches.	2011
123	Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design.	2012
120	Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches.	2009
118	EazyHTM: eager-lazy hardware transactional memory.	2009
118	Characterizing and mitigating the impact of process variations on phase change based memory systems.	2009
115	A Mechanism for Online Diagnosis of Hard Faults in Microprocessors.	2005
115	Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems.	2009
113	Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer.	2008
113	Token flow control.	2008
113	Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip.	2009
112	Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing.	2007
112	QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores.	2011
111	CoScale: Coordinating CPU and Memory System DVFS in Server Systems.	2012
111	SAGE: self-tuning approximation for graphics engines.	2013
110	Yield-Aware Cache Architectures.	2006
108	Process Variation Tolerant 3T1D-Based Cache Architectures.	2007
107	Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation.	2007
107	Quality programmable vector processors for approximate computing.	2013
106	Sampling Dead Block Prediction for Last-Level Caches.	2010
105	Light speed arbitration and flow control for nanophotonic interconnects.	2009
104	Dependence-aware transactional memory for increased concurrency.	2008
104	Parallel application memory scheduling.	2011
102	Self-calibrating Online Wearout Detection.	2007
99	Leveraging 3D Technology for Improved Reliability.	2007
99	Complexity effective memory access scheduling for many-core accelerator architectures.	2009
99	Composite Cores: Pushing Heterogeneity Into a Core.	2012
97	Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor.	2005
96	From SODA to scotch: The evolution of a wireless baseband processor.	2008
96	Task Superscalar: An Out-of-Order Task Pipeline.	2010
96	Active management of timing guardband to save energy in POWER7.	2011
94	A case for dynamic frequency tuning in on-chip networks.	2009
94	Many-Thread Aware Prefetching Mechanisms for GPGPU Applications.	2010
92	Adaptive Caches: Effective Shaping of Cache Behavior to Workloads.	2006
92	Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly.	2007
91	The TM3270 Media-Processor.	2005
91	EVAL: Utilizing processors with variation-induced timing errors.	2008
90	Memory Prefetching Using Adaptive Stream Detection.	2006
90	Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence.	2008
90	Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies.	2010
89	Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications.	2009
88	SD3: A Scalable Approach to Dynamic Data-Dependence Profiling.	2010
87	Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era.	2011
86	Bundled execution of recurring traces for energy-efficient general purpose processing.	2011
85	Efficient unicast and multicast support for CMPs.	2008
84	The BubbleWrap many-core: popping cores for sequential acceleration.	2009
83	Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory.	2010
82	Throughput-Effective On-Chip Networks for Manycore Accelerators.	2010
81	Finding concurrency bugs with context-aware communication graphs.	2009
80	The StageNet fabric for constructing resilient multicore systems.	2008
80	Meet the walkers: accelerating index traversals for in-memory databases.	2013
78	mSWAT: low-cost hardware fault detection and diagnosis for multicore systems.	2009
78	PACMan: prefetch-aware cache management for high performance caching.	2011
78	Kiln: closing the performance gap between systems with and without persistence support.	2013
77	A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation.	2005
77	Low Vccmin fault-tolerant cache with highly predictable performance.	2009
77	ZerehCache: armoring cache architectures in high defect density technologies.	2009
76	Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers.	2006
75	Microarchitectural Design Space Exploration Using an Architecture-Centric Approach.	2007
75	Pay-As-You-Go: low-overhead hard-error correction for phase change memories.	2011
74	Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy.	2009
73	Adaptive line placement with theset balancing cache.	2009
73	Improving Cache Management Policies Using Dynamic Reuse Distances.	2012
72	KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity.	2012
70	NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers.	2012
70	Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor.	2012
69	Scalable Store-Load Forwarding via Store Queue Index Prediction.	2005
69	Improving memory bank-level parallelism in the presence of prefetching.	2009
69	Divergence-aware warp scheduling.	2013
68	RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.	2013
67	ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory.	2010
67	SIMD re-convergence at thread frontiers.	2011
66	NoSQ: Store-Load Communication without a Store Queue.	2006
66	Power reduction of CMP communication networks via RF-interconnects.	2008
66	Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation.	2012
64	The Cell Processor Architecture.	2005
63	Tribeca: design for PVT variations with local recovery and fine-grained adaptation.	2009
63	A Dynamically Adaptable Hardware Transactional Memory.	2010
62	A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy.	2007
61	Shader Performance Analysis on a Modern GPU Architecture.	2005
61	Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures.	2007
61	Preventing PCM banks from seizing too much power.	2011
60	Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions.	2006
60	CPR: Composable performance regression for scalable multiprocessor models.	2008
60	Notary: Hardware techniques to enhance signatures.	2008
60	ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem.	2009
60	Hardware transactional memory for GPU architectures.	2011
60	Predicting Performance Impact of DVFS for Realistic Memory Systems.	2012
59	Coherence Ordering for Ring-based Chip Multiprocessors.	2006
59	Scavenger: A New Last Level Cache Architecture with Global Block Priority.	2007
59	Power to the people: Leveraging human physiological traits to control microprocessor frequency.	2008
59	MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP.	2012
59	Heterogeneous system coherence for integrated CPU-GPU systems.	2013
58	Fire-and-Forget: Load/Store Scheduling with No Store Queue at All.	2006
58	Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication.	2011
58	Architectural support for secure virtualization under a vulnerable hypervisor.	2011
57	Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities.	2005
57	Fairness and Throughput in Switch on Event Multithreading.	2006
56	SHARP control: controlled shared cache management in chip multiprocessors.	2009
55	Thermal Management of On-Chip Caches Through Power Density Minimization.	2005
55	Proactive transaction scheduling for contention management.	2009
54	Portable compiler optimisation across embedded programs and microarchitectures using machine learning.	2009
54	NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures.	2012
54	CoLT: Coalesced Large-Reach TLBs.	2012
53	A locality-aware memory hierarchy for energy-efficient GPU architectures.	2013
52	Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution.	2005
52	Improving Region Selection in Dynamic Optimization Systems.	2005
52	Emulating Optimal Replacement with a Shepherd Cache.	2007
52	Online design bug detection: RTL analysis, flexible mechanisms, and evaluation.	2008
52	ReMAP: A Reconfigurable Heterogeneous Multicore Architecture.	2010
52	Spatiotemporal Coherence Tracking.	2012
51	Reconfigurable energy efficient near threshold cache architectures.	2008
51	Adaptive Cache Management for Energy-Efficient GPU Computing.	2014
50	Scalable Cache Miss Handling for High Memory-Level Parallelism.	2006
49	Address-Indexed Memory Disambiguation and Store-to-Load Forwarding.	2005
49	Tradeoffs in designing accelerator architectures for visual computing.	2008
49	Toward a multicore architecture for real-time ray-tracing.	2008
49	Execution leases: a hardware-supported mechanism for enforcing strong non-interference.	2009
49	In-network coherence filtering: snoopy coherence without broadcasts.	2009
49	BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support.	2009
49	Combating Aging with the Colt Duty Cycle Equalizer.	2010
49	A Predictive Model for Dynamic Microarchitectural Adaptivity Control.	2010
49	Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache.	2014
48	ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing.	2005
48	CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs.	2006
48	Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling.	2010
48	A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch.	2012
47	Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units.	2006
47	NBTI tolerant microarchitecture design in the presence of process variation.	2008
47	Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric.	2010
47	A compile-time managed multi-level register file hierarchy.	2011
47	Linearly compressed pages: a low-complexity, low-latency main memory compression framework.	2013
45	Continuous Path and Edge Profiling.	2005
45	Token tenure: PATCHing token counting using directory-based cache coherence.	2008
45	Memory Latency Reduction via Thread Throttling.	2010
45	Fractal Coherence: Scalably Verifiable Cache Coherence.	2010
45	A new case for the TAGE branch predictor.	2011
45	FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory.	2012
45	Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching.	2013
44	Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware.	2006
44	Temporal instruction fetch streaming.	2008
44	Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy.	2012
44	SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers.	2014
43	Reducing peak power with a table-driven adaptive processor core.	2009
42	“”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “	2005
42	Offline symbolic analysis for multi-processor execution replay.	2009
42	Efficient Selection of Vector Instructions Using Dynamic Programming.	2010
42	A resistive TCAM accelerator for data-intensive computing.	2011
42	Rethinking DRAM Power Modes for Energy Proportionality.	2012
42	FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.	2014
41	Adaptive data compression for high-performance low-power on-chip networks.	2008
41	Low-power, high-performance analog neural branch prediction.	2008
41	An hybrid eDRAM/SRAM macrocell to implement first-level data caches.	2009
40	A Criticality Analysis of Clustering in Superscalar Processors.	2005
40	Optimizing shared cache behavior of chip multiprocessors.	2009
40	Multiple clock and voltage domains for chip multi processors.	2009
40	Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs.	2010
40	AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection.	2010
40	Dataflow execution of sequential imperative programs on multicore architectures.	2011
40	Proactive instruction fetch.	2011
40	Managing GPU Concurrency in Heterogeneous Architectures.	2014
40	Load Value Approximation.	2014
39	NOC-Out: Microarchitecting a Scale-Out Processor.	2012
38	Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors.	2005
38	Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors.	2007
38	Adaptive Flow Control for Robust Performance and Energy.	2010
38	Register Cache System Not for Latency Reduction Purpose.	2010
38	Linearizing irregular memory accesses for improved correlated prefetching.	2013
37	Automatic Parallelization in a Binary Rewriter.	2010
37	Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches.	2010
37	Large-reach memory management unit caches.	2013
37	Multi-grain coherence directories.	2013
37	Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution.	2014
37	CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache.	2014
36	Dataflow Predication.	2006
36	Support for High-Frequency Streaming in CMPs.	2006
36	Impact of Cache Coherence Protocols on the Processing of Network Traffic.	2007
36	Architecting a chunk-based memory race recorder in modern CMPs.	2009
36	The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.	2015
35	Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns.	2005
34	uComplexity: Estimating Processor Design Effort.	2005
34	Store Memory-Level Parallelism Optimizations for Commercial Applications.	2005
34	Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications.	2007
34	Informed Microarchitecture Design Space Exploration Using Workload Dynamics.	2007
34	Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache.	2007
34	Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.	2008
34	Encore: low-cost, fine-grained transient fault recovery.	2011
33	DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage.	2009
33	PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration.	2014
32	Characterizing the resource-sharing levels in the UltraSPARC T2 processor.	2009
32	Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors.	2010
32	Packet chaining: efficient single-cycle allocation for on-chip networks.	2011
32	Accurate Fine-Grained Processor Power Proxies.	2012
32	Warped gates: gating aware scheduling and power gating for GPGPUs.	2013
31	LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support.	2010
31	System-level integrated server architectures for scale-out datacenters.	2011
31	Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization.	2012
31	A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events.	2014
31	Random Fill Cache Architecture.	2014
30	Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System.	2005
30	PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection.	2006
30	Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access.	2012
30	Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device.	2013
30	Transparent Hardware Management of Stacked DRAM as Part of Memory.	2014
30	PORPLE: An Extensible Optimizer for Portable Data Placement on GPU.	2014
29	How to Fake 1000 Registers.	2005
29	Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology.	2008
29	Scalable Speculative Parallelization on Commodity Clusters.	2010
29	Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically.	2012
29	Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks.	2014
28	Authentication Control Point and Its Implications For Secure Processor Design.	2006
28	The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration.	2007
28	A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags.	2008
28	A performance-correctness explicitly-decoupled architecture.	2008
28	Light64: lightweight hardware support for data race detection during systematic testing of parallel programs.	2009
27	Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths.	2006
27	Tolerating Concurrency Bugs Using Transactions as Lifeguards.	2010
27	Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks.	2010
27	Accelerating microprocessor silicon validation by exposing ISA diversity.	2011
27	CoreRacer: a practical memory race recorder for multicore x86 TSO processors.	2011
27	Formally enhanced runtime verification to ensure NoC functional correctness.	2011
27	Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits.	2011
27	Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator.	2012
27	Warped-DMR: Light-weight Error Detection for GPGPU.	2012
27	Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers.	2014
26	Merging Head and Tail Duplication for Convergent Hyperblock Formation.	2006
26	Shapeshifter: Dynamically changing pipeline width and speed to address process variations.	2008
26	Control flow obfuscation with information flow tracking.	2009
26	Ordering decoupled metadata accesses in multiprocessors.	2009
26	STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches.	2010
26	Insertion and promotion for tree-based PseudoLRU last-level caches.	2013
26	Trace based phase prediction for tightly-coupled heterogeneous cores.	2013
26	Locality-Aware Mapping of Nested Parallel Patterns on GPUs.	2014
26	CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware.	2014
26	Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.	2015
25	Exploiting Vector Parallelism in Software Pipelined Loops.	2005
25	Variation-tolerant non-uniform 3D cache management in die stacked multicore processor.	2009
25	Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance.	2012
25	AUDIT: Stress Testing the Automatic Way.	2012
25	The reuse cache: downsizing the shared last-level cache.	2013
25	Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth.	2014
24	Effective Optimistic-Checker Tandem Core Design through Architectural Pruning.	2007
24	AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors.	2010
24	Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors.	2010
24	FeatherWeight: low-cost optical arbitration with QoS support.	2011
24	Enabling datacenter servers to scale out economically and sustainably.	2013
24	uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults.	2013
24	TLC: a tag-less cache for reducing dynamic first level cache energy.	2013
23	ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment.	2010
22	Adaptive and Speculative Slack Simulations of CMPs on CMPs.	2010
22	Hardware Support for Relaxed Concurrency Control in Transactional Memory.	2010
22	Idempotent processor architecture.	2011
22	Identifying and predicting timing-critical instructions to boost timing speculation.	2011
21	Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines.	2005
21	A microarchitecture-based framework for pre- and post-silicon power delivery analysis.	2009
21	Addressing End-to-End Memory Access Latency in NoC-Based Multicores.	2012
20	A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design.	2006
20	Global Multi-Threaded Instruction Scheduling.	2007
20	Erasing Core Boundaries for Robust and Configurable Performance.	2010
20	RDIP: return-address-stack directed instruction prefetching.	2013
20	Crank it up or dial it down: coordinated multiprocessor frequency and folding control.	2013
20	Skewed Compressed Caches.	2014
20	ThyNVM: enabling software-transparent crash consistency in persistent memory systems.	2015
19	Doppelgänger: a cache for approximate computing.	2015
19	Verification of chip multiprocessor memory systems using a relaxed scoreboard.	2008
19	Implementing high availability memory with a duplication cache.	2008
19	Evaluating the effects of cache redundancy on profit.	2008
19	Architectural Support for Fair Reader-Writer Locking.	2010
19	The NoX router.	2011
19	Vector Extensions for Decision Support DBMS Acceleration.	2012
19	Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks.	2012
19	Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory.	2013
19	Use it or lose it: wear-out and lifetime in future chip multiprocessors.	2013
19	Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults.	2014
19	Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks.	2014
19	Futility Scaling: High-Associativity Cache Partitioning.	2014
19	Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution.	2014
18	DMDC: Delayed Memory Dependence Checking through Age-Based Filtering.	2006
18	Virtually Pipelined Network Memory.	2006
18	Strategies for mapping dataflow blocks to distributed hardware.	2008
18	Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels.	2010
18	Resilient microring resonator based photonic networks.	2011
18	Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists.	2014
18	Neural acceleration for GPU throughput processors.	2015
18	Jump over ASLR: Attacking branch predictors to bypass ASLR.	2016
17	Manager-client pairing: a framework for implementing coherence hierarchies.	2011
16	Serialization-Aware Mini-Graphs: Performance with Fewer Resources.	2006
16	Time Interpolation: So Many Metrics, So Few Registers.	2007
16	Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models.	2014
16	Harnessing Soft Computations for Low-Budget Fault Tolerance.	2014
16	Large pages and lightweight memory management in virtualized environments: can you have it both ways?	2015
15	Testudo: Heavyweight security analysis via statistical sampling.	2008
15	SHARK: Architectural support for autonomic protection against stealth by rootkit exploits.	2008
15	A systematic methodology to develop resilient cache coherence protocols.	2011
15	A data layout optimization framework for NUCA-based multicores.	2011
15	Inferred Models for Dynamic and Sparse Hardware-Software Spaces.	2012
15	Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability.	2012
15	BuMP: Bulk Memory Access Prediction and Streaming.	2014
14	CCICheck: usingµhb graphs to verify the coherence-consistency interface.	2015
14	Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows.	2005
14	Optimal versus Heuristic Global Code Scheduling.	2007
14	InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing.	2010
14	Virtual Snooping: Filtering Snoops in Virtualized Multi-cores.	2010
14	SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads.	2012
14	SHIFT: shared history instruction fetch for lean-core server processors.	2013
14	Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities.	2014
14	PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research.	2014
13	Using a configurable processor generator for computer architecture prototyping.	2009
13	POWER7 multi-core processor design.	2009
13	Energy efficient GPU transactional memory via space-time optimizations.	2013
13	Imbalanced cache partitioning for balanced data-parallel programs.	2013
13	Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures.	2014
13	Multi-GPU System Design with Memory Networks.	2014
13	Arbitrary Modulus Indexing.	2014
13	Efficient persist barriers for multicores.	2015
12	A register-file approach for row buffer caches in die-stacked DRAMs.	2011
12	NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?	2014
12	Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures.	2014
12	Architectural Specialization for Inter-Iteration Loop Dependence Patterns.	2014
12	Free launch: optimizing GPU dynamic kernel launches through thread reuse.	2015
12	Enabling interposer-based disintegration of multi-core processors.	2015
11	Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection.	2006
11	Memory Protection through Dynamic Access Control.	2006
11	Complementing user-level coarse-grain parallelism with implicit speculative parallelism.	2011
11	The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults.	2012
11	DESC: energy-efficient data exchange using synchronized counters.	2013
11	Efficient multiprogramming for multicores with SCAF.	2013
11	B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors.	2014
11	Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance.	2015
11	Efficiently prefetching complex address patterns.	2015
10	The Future Evolution of High-Performance Microprocessors.	2005
10	Efficient Use of Invisible Registers in Thumb Code.	2005
10	Tree register allocation.	2009
10	MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP.	2013
10	Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems.	2014
10	Avoiding information leakage in the memory controller with fixed service policies.	2015
10	A scalable architecture for ordered parallelism.	2015
10	A cloud-scale acceleration architecture.	2016
9	A distributed processor state management architecture for large-window processors.	2008
9	ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer.	2011
9	Predicting Coherence Communication by Tracking Synchronization Points at Run Time.	2012
9	Efficient management of last-level caches in graphics processors for 3D scene rendering workloads.	2013
9	Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency.	2013
9	Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration.	2014
9	RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks.	2014
9	Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems.	2015
9	Fast support for unstructured data processing: the unified automata processor.	2015
9	Enabling coordinated register allocation and thread-level parallelism optimization for GPUs.	2015
8	Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects.	2013
8	Dodec: Random-Link, Low-Radix On-Chip Networks.	2014
8	Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors.	2014
8	GPU register file virtualization.	2015
8	Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches.	2015
8	Coherence domain restriction on large scale systems.	2015
8	Efficient GPU synchronization without scopes: saying no to complex consistency models.	2015
8	Rubik: fast analytical power management for latency-critical systems.	2015
8	Delegated persist ordering.	2016
7	Control-Flow Decoupling.	2012
7	A Front-End Execution Architecture for High Energy Efficiency.	2014
7	Short-Circuiting Memory Traffic in Handheld Platforms.	2014
7	Execution Drafting: Energy Efficiency through Computation Deduplication.	2014
7	Improving DRAM latency with dynamic asymmetric subarray.	2015
7	The inner most loop iteration counter: a new dimension in branch history.	2015
7	TimeTrader: exploiting latency tail to save datacenter energy for online search.	2015
7	Fork path: improving efficiency of ORAM by removing redundant memory accesses.	2015
7	IMP: indirect memory prefetcher.	2015
7	Stripes: Bit-serial deep neural network computing.	2016
6	Incremental Commit Groups for Non-Atomic Trace Processing.	2005
6	Architecture-aware automatic computation offload for native applications.	2015
6	Border control: sandboxing accelerators.	2015
6	Microarchitectural implications of event-driven server-side web applications.	2015
6	Efficient warp execution in presence of divergence with collaborative context collection.	2015
6	Characterizing, modeling, and improving the QoE of mobile devices with low battery level.	2015
5	Data-Dependency Graph Transformations for Superblock Scheduling.	2006
5	TransCom: transforming stream communication for load balance and efficiency in networks-on-chip.	2011
5	Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem.	2012
5	Compiler Support for Optimizing Memory Bank-Level Parallelism.	2014
5	Wormhole: Wisely Predicting Multidimensional Branches.	2014
5	Loop-Aware Memory Prefetching Using Code Block Working Sets.	2014
5	The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU.	2015
5	An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors.	2015
5	Prediction-guided performance-energy trade-off for interactive applications.	2015
5	Continuous runahead: Transparent hardware acceleration for memory intensive workloads.	2016
5	Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency.	2016
4	Why design must change: rethinking digital design.	2009
4	GPUMech: GPU Performance Modeling Technique Based on Interval Analysis.	2014
4	Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach.	2015
4	A fast and accurate analytical technique to compute the AVF of sequential bits in a processor.	2015
4	Efficiently enforcing strong memory ordering in GPUs.	2015
4	Authenticache: harnessing cache ECC for system authentication.	2015
4	Execution time prediction for energy-efficient hardware accelerators.	2015
4	Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection.	2016
4	Graphicionado: A high-performance and energy-efficient accelerator for graph analytics.	2016
4	Co-designing accelerators and SoC interfaces using gem5-Aladdin.	2016
4	Improving bank-level parallelism for irregular applications.	2016
4	KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism.	2016
3	SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations.	2012
3	Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection.	2013
3	DeSC: decoupled supply-compute communication management for heterogeneous architectures.	2015
3	HyComp: a hybrid cache compression method for selection of data-type-specific compression methods.	2015
3	Locking down insecure indirection with hardware-based control-data isolation.	2015
3	Modeling the implications of DRAM failures and protection techniques on datacenter TCO.	2015
3	More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes.	2015
3	Fused-layer CNN accelerators.	2016
3	Towards efficient server architecture for virtualized network function deployment: Implications and implementations.	2016
3	Racer: TSO consistency via race detection.	2016
2	Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins.	2008
2	CRAM: coded registers for amplified multiporting.	2011
2	Allocating rotating registers by scheduling.	2013
2	Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes.	2013
2	Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations.	2014
2	Continuous, Low Overhead, Run-Time Validation of Program Executions.	2014
2	Bias-Free Branch Predictor.	2014
2	Bungee jumps: accelerating indirect branches through HW/SW co-design.	2015
2	Adaptive guardband scheduling to improve system-level efficiency of the POWER7+.	2015
2	MORC: a manycore-oriented compressed cache.	2015
2	CLEAN-ECC: high reliability ECC for adaptive granularity memory system.	2015
2	DynaMOS: dynamic schedule migration for heterogeneous cores.	2015
2	Self-contained, accurate precomputation prefetching.	2015
2	Confluence: unified instruction supply for scale-out servers.	2015
2	Filtered runahead execution with a runahead buffer.	2015
2	SABRes: Atomic object reads for in-memory rack-scale computing.	2016
2	Cambricon-X: An accelerator for sparse neural networks.	2016
2	Efficient kernel synthesis for performance portable programming.	2016
2	Chainsaw: Von-neumann accelerators to leverage fused instruction chains.	2016
2	Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach.	2016
2	Spectral profiling: Observer-effect-free profiling by monitoring EM emanations.	2016
2	From high-level deep neural models to FPGAs.	2016
1	Microarchitecture in the system-level integration era.	2008
1	BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment.	2013
1	COMP: Compiler Optimizations for Manycore Processors.	2014
1	SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers.	2015
1	vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments.	2015
1	WarpPool: sharing requests with inter-warp coalescing for throughput processors.	2015
1	Enabling portable energy efficiency with memory accelerated library.	2015
1	DCS: a fast and scalable device-centric server architecture.	2015
1	Long term parking (LTP): criticality-aware resource allocation in OOO processors.	2015
1	A unified memory network architecture for in-memory computing in commodity servers.	2016
1	Path confidence based lookahead prefetching.	2016
1	Ti-states: Processor power management in the temperature inversion region.	2016
1	Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs.	2016
1	Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems.	2016
1	An ultra low-power hardware accelerator for automatic speech recognition.	2016
1	HARE: Hardware accelerator for regular expressions.	2016
1	Evaluating programmable architectures for imaging and vision applications.	2016
1	Lazy release consistency for GPUs.	2016
1	Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting.	2016
1	Quantifying and improving the efficiency of hardware-based mobile malware detectors.	2016
1	vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design.	2016
1	Perceptron learning for reuse prediction.	2016
1	NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints.	2016
1	C3D: Mitigating the NUMA bottleneck via coherent DRAM caches.	2016
1	OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures.	2016
0	Message from the General Chairs.	2005
0	Message from the Program Co-Chairs.	2005
0	Control flow coalescing on a hybrid dataflow/von Neumann GPGPU.	2015
0	Ultra-low power render-based collision detection for CPU/GPU systems.	2015
0	Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks.	2016
0	pTask: A smart prefetching scheme for OS intensive applications.	2016
0	MIMD synchronization on SIMT architectures.	2016
0	Redefining QoS and customizing the power management policy to satisfy individual mobile users.	2016
0	Contention-based congestion management in large-scale networks.	2016
0	PoisonIvy: Safe speculation for secure memory.	2016
0	The Bunker Cache for spatio-value approximation.	2016
0	Register sharing for equality prediction.	2016
0	CrystalBall: Statically analyzing runtime behavior via deep sequence learning.	2016
0	ReplayConfusion: Detecting cache-based covert channel attacks using record and replay.	2016
0	Dynamic error mitigation in NoCs using intelligent prediction techniques.	2016
0	Zorua: A holistic approach to resource virtualization in GPUs.	2016
0	A patch memory system for image processing and computer vision.	2016
0	Improving energy efficiency of DRAM by exploiting half page row access.	2016
0	Efficient data supply for hardware accelerators with prefetching and access/execute decoupling.	2016
0	The microarchitecture of a real-time robot motion planning accelerator.	2016
0	CANDY: Enabling coherent DRAM caches for multi-node systems.	2016
0	GRAPE: Minimizing energy for GPU applications with performance requirements.	2016
0	Exploiting semantic commutativity in hardware speculation.	2016
0	Dictionary sharing: An efficient cache compression scheme for compressed caches.	2016
0	Data-centric execution of speculative parallel programs.	2016
0	NeSC: Self-virtualizing nested storage controller.	2016
0	Reducing data movement energy via online data clustering and encoding.	2016
0	Keynotes: Internet of Things: History and hype, technology and policy.	2016
0	Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation.	2016

2016¶

Cited by	Paper title
18	Jump over ASLR: Attacking branch predictors to bypass ASLR.
10	A cloud-scale acceleration architecture.
8	Delegated persist ordering.
7	Stripes: Bit-serial deep neural network computing.
5	Continuous runahead: Transparent hardware acceleration for memory intensive workloads.
5	Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency.
4	Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection.
4	Graphicionado: A high-performance and energy-efficient accelerator for graph analytics.
4	Co-designing accelerators and SoC interfaces using gem5-Aladdin.
4	Improving bank-level parallelism for irregular applications.
4	KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism.
3	Fused-layer CNN accelerators.
3	Towards efficient server architecture for virtualized network function deployment: Implications and implementations.
3	Racer: TSO consistency via race detection.
2	SABRes: Atomic object reads for in-memory rack-scale computing.
2	Cambricon-X: An accelerator for sparse neural networks.
2	Efficient kernel synthesis for performance portable programming.
2	Chainsaw: Von-neumann accelerators to leverage fused instruction chains.
2	Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach.
2	Spectral profiling: Observer-effect-free profiling by monitoring EM emanations.
2	From high-level deep neural models to FPGAs.
1	A unified memory network architecture for in-memory computing in commodity servers.
1	Path confidence based lookahead prefetching.
1	Ti-states: Processor power management in the temperature inversion region.
1	Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs.
1	Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems.
1	An ultra low-power hardware accelerator for automatic speech recognition.
1	HARE: Hardware accelerator for regular expressions.
1	Evaluating programmable architectures for imaging and vision applications.
1	Lazy release consistency for GPUs.
1	Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting.
1	Quantifying and improving the efficiency of hardware-based mobile malware detectors.
1	vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design.
1	Perceptron learning for reuse prediction.
1	NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints.
1	C3D: Mitigating the NUMA bottleneck via coherent DRAM caches.
1	OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures.
0	Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks.
0	pTask: A smart prefetching scheme for OS intensive applications.
0	MIMD synchronization on SIMT architectures.
0	Redefining QoS and customizing the power management policy to satisfy individual mobile users.
0	Contention-based congestion management in large-scale networks.
0	PoisonIvy: Safe speculation for secure memory.
0	The Bunker Cache for spatio-value approximation.
0	Register sharing for equality prediction.
0	CrystalBall: Statically analyzing runtime behavior via deep sequence learning.
0	ReplayConfusion: Detecting cache-based covert channel attacks using record and replay.
0	Dynamic error mitigation in NoCs using intelligent prediction techniques.
0	Zorua: A holistic approach to resource virtualization in GPUs.
0	A patch memory system for image processing and computer vision.
0	Improving energy efficiency of DRAM by exploiting half page row access.
0	Efficient data supply for hardware accelerators with prefetching and access/execute decoupling.
0	The microarchitecture of a real-time robot motion planning accelerator.
0	CANDY: Enabling coherent DRAM caches for multi-node systems.
0	GRAPE: Minimizing energy for GPU applications with performance requirements.
0	Exploiting semantic commutativity in hardware speculation.
0	Dictionary sharing: An efficient cache compression scheme for compressed caches.
0	Data-centric execution of speculative parallel programs.
0	NeSC: Self-virtualizing nested storage controller.
0	Reducing data movement energy via online data clustering and encoding.
0	Keynotes: Internet of Things: History and hype, technology and policy.
0	Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation.

2015¶

Cited by	Paper title
36	The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.
26	Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.
20	ThyNVM: enabling software-transparent crash consistency in persistent memory systems.
19	Doppelgänger: a cache for approximate computing.
18	Neural acceleration for GPU throughput processors.
16	Large pages and lightweight memory management in virtualized environments: can you have it both ways?
14	CCICheck: usingµhb graphs to verify the coherence-consistency interface.
13	Efficient persist barriers for multicores.
12	Free launch: optimizing GPU dynamic kernel launches through thread reuse.
12	Enabling interposer-based disintegration of multi-core processors.
11	Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance.
11	Efficiently prefetching complex address patterns.
10	Avoiding information leakage in the memory controller with fixed service policies.
10	A scalable architecture for ordered parallelism.
9	Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems.
9	Fast support for unstructured data processing: the unified automata processor.
9	Enabling coordinated register allocation and thread-level parallelism optimization for GPUs.
8	GPU register file virtualization.
8	Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches.
8	Coherence domain restriction on large scale systems.
8	Efficient GPU synchronization without scopes: saying no to complex consistency models.
8	Rubik: fast analytical power management for latency-critical systems.
7	Improving DRAM latency with dynamic asymmetric subarray.
7	The inner most loop iteration counter: a new dimension in branch history.
7	TimeTrader: exploiting latency tail to save datacenter energy for online search.
7	Fork path: improving efficiency of ORAM by removing redundant memory accesses.
7	IMP: indirect memory prefetcher.
6	Architecture-aware automatic computation offload for native applications.
6	Border control: sandboxing accelerators.
6	Microarchitectural implications of event-driven server-side web applications.
6	Efficient warp execution in presence of divergence with collaborative context collection.
6	Characterizing, modeling, and improving the QoE of mobile devices with low battery level.
5	The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU.
5	An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors.
5	Prediction-guided performance-energy trade-off for interactive applications.
4	Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach.
4	A fast and accurate analytical technique to compute the AVF of sequential bits in a processor.
4	Efficiently enforcing strong memory ordering in GPUs.
4	Authenticache: harnessing cache ECC for system authentication.
4	Execution time prediction for energy-efficient hardware accelerators.
3	DeSC: decoupled supply-compute communication management for heterogeneous architectures.
3	HyComp: a hybrid cache compression method for selection of data-type-specific compression methods.
3	Locking down insecure indirection with hardware-based control-data isolation.
3	Modeling the implications of DRAM failures and protection techniques on datacenter TCO.
3	More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes.
2	Bungee jumps: accelerating indirect branches through HW/SW co-design.
2	Adaptive guardband scheduling to improve system-level efficiency of the POWER7+.
2	MORC: a manycore-oriented compressed cache.
2	CLEAN-ECC: high reliability ECC for adaptive granularity memory system.
2	DynaMOS: dynamic schedule migration for heterogeneous cores.
2	Self-contained, accurate precomputation prefetching.
2	Confluence: unified instruction supply for scale-out servers.
2	Filtered runahead execution with a runahead buffer.
1	SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers.
1	vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments.
1	WarpPool: sharing requests with inter-warp coalescing for throughput processors.
1	Enabling portable energy efficiency with memory accelerated library.
1	DCS: a fast and scalable device-centric server architecture.
1	Long term parking (LTP): criticality-aware resource allocation in OOO processors.
0	Control flow coalescing on a hybrid dataflow/von Neumann GPGPU.
0	Ultra-low power render-based collision detection for CPU/GPU systems.

2014¶

Cited by	Paper title
169	DaDianNao: A Machine-Learning Supercomputer.
51	Adaptive Cache Management for Energy-Efficient GPU Computing.
49	Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache.
44	SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers.
42	FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.
40	Managing GPU Concurrency in Heterogeneous Architectures.
40	Load Value Approximation.
37	Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution.
37	CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache.
33	PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration.
31	A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events.
31	Random Fill Cache Architecture.
30	Transparent Hardware Management of Stacked DRAM as Part of Memory.
30	PORPLE: An Extensible Optimizer for Portable Data Placement on GPU.
29	Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks.
27	Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers.
26	Locality-Aware Mapping of Nested Parallel Patterns on GPUs.
26	CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware.
25	Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth.
20	Skewed Compressed Caches.
19	Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults.
19	Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks.
19	Futility Scaling: High-Associativity Cache Partitioning.
19	Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution.
18	Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists.
16	Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models.
16	Harnessing Soft Computations for Low-Budget Fault Tolerance.
15	BuMP: Bulk Memory Access Prediction and Streaming.
14	Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities.
14	PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research.
13	Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures.
13	Multi-GPU System Design with Memory Networks.
13	Arbitrary Modulus Indexing.
12	NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?
12	Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures.
12	Architectural Specialization for Inter-Iteration Loop Dependence Patterns.
11	B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors.
10	Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems.
9	Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration.
9	RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks.
8	Dodec: Random-Link, Low-Radix On-Chip Networks.
8	Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors.
7	A Front-End Execution Architecture for High Energy Efficiency.
7	Short-Circuiting Memory Traffic in Handheld Platforms.
7	Execution Drafting: Energy Efficiency through Computation Deduplication.
5	Compiler Support for Optimizing Memory Bank-Level Parallelism.
5	Wormhole: Wisely Predicting Multidimensional Branches.
5	Loop-Aware Memory Prefetching Using Code Block Working Sets.
4	GPUMech: GPU Performance Modeling Technique Based on Interval Analysis.
2	Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations.
2	Continuous, Low Overhead, Run-Time Validation of Program Executions.
2	Bias-Free Branch Predictor.
1	COMP: Compiler Optimizations for Manycore Processors.

2013¶

Cited by	Paper title
136	Approximate storage in solid-state memories.
111	SAGE: self-tuning approximation for graphics engines.
107	Quality programmable vector processors for approximate computing.
80	Meet the walkers: accelerating index traversals for in-memory databases.
78	Kiln: closing the performance gap between systems with and without persistence support.
69	Divergence-aware warp scheduling.
68	RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.
59	Heterogeneous system coherence for integrated CPU-GPU systems.
53	A locality-aware memory hierarchy for energy-efficient GPU architectures.
47	Linearly compressed pages: a low-complexity, low-latency main memory compression framework.
45	Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching.
38	Linearizing irregular memory accesses for improved correlated prefetching.
37	Large-reach memory management unit caches.
37	Multi-grain coherence directories.
32	Warped gates: gating aware scheduling and power gating for GPGPUs.
30	Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device.
26	Insertion and promotion for tree-based PseudoLRU last-level caches.
26	Trace based phase prediction for tightly-coupled heterogeneous cores.
25	The reuse cache: downsizing the shared last-level cache.
24	Enabling datacenter servers to scale out economically and sustainably.
24	uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults.
24	TLC: a tag-less cache for reducing dynamic first level cache energy.
20	RDIP: return-address-stack directed instruction prefetching.
20	Crank it up or dial it down: coordinated multiprocessor frequency and folding control.
19	Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory.
19	Use it or lose it: wear-out and lifetime in future chip multiprocessors.
14	SHIFT: shared history instruction fetch for lean-core server processors.
13	Energy efficient GPU transactional memory via space-time optimizations.
13	Imbalanced cache partitioning for balanced data-parallel programs.
11	DESC: energy-efficient data exchange using synchronized counters.
11	Efficient multiprogramming for multicores with SCAF.
10	MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP.
9	Efficient management of last-level caches in graphics processors for 3D scene rendering workloads.
9	Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency.
8	Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects.
3	Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection.
2	Allocating rotating registers by scheduling.
2	Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes.
1	BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment.

2012¶

Cited by	Paper title
321	Neural Acceleration for General-Purpose Approximate Programs.
206	Cache-Conscious Wavefront Scheduling.
125	Transactional Memory Architecture and Implementation for IBM System Z.
123	Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design.
111	CoScale: Coordinating CPU and Memory System DVFS in Server Systems.
99	Composite Cores: Pushing Heterogeneity Into a Core.
73	Improving Cache Management Policies Using Dynamic Reuse Distances.
72	KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity.
70	NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers.
70	Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor.
66	Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation.
60	Predicting Performance Impact of DVFS for Realistic Memory Systems.
59	MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP.
54	NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures.
54	CoLT: Coalesced Large-Reach TLBs.
52	Spatiotemporal Coherence Tracking.
48	A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch.
45	FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory.
44	Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy.
42	Rethinking DRAM Power Modes for Energy Proportionality.
39	NOC-Out: Microarchitecting a Scale-Out Processor.
32	Accurate Fine-Grained Processor Power Proxies.
31	Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization.
30	Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access.
29	Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically.
27	Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator.
27	Warped-DMR: Light-weight Error Detection for GPGPU.
25	Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance.
25	AUDIT: Stress Testing the Automatic Way.
21	Addressing End-to-End Memory Access Latency in NoC-Based Multicores.
19	Vector Extensions for Decision Support DBMS Acceleration.
19	Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks.
15	Inferred Models for Dynamic and Sparse Hardware-Software Spaces.
15	Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability.
14	SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads.
11	The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults.
9	Predicting Coherence Communication by Tracking Synchronization Points at Run Time.
7	Control-Flow Decoupling.
5	Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem.
3	SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations.

2011¶

Cited by	Paper title
285	Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations.
253	Improving GPU performance via large warps and two-level warp scheduling.
170	Multi retention level STT-RAM cache designs with a dynamic refresh scheme.
166	Reducing memory interference in multicore systems via application-aware memory channel partitioning.
146	Pack&Cap: adaptive DVFS and thread packing under power caps.
129	SHiP: signature-based hit predictor for high performance caching.
123	Efficiently enabling conventional block sizes for very large die-stacked DRAM caches.
112	QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores.
104	Parallel application memory scheduling.
96	Active management of timing guardband to save energy in POWER7.
87	Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era.
86	Bundled execution of recurring traces for energy-efficient general purpose processing.
78	PACMan: prefetch-aware cache management for high performance caching.
75	Pay-As-You-Go: low-overhead hard-error correction for phase change memories.
67	SIMD re-convergence at thread frontiers.
61	Preventing PCM banks from seizing too much power.
60	Hardware transactional memory for GPU architectures.
58	Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication.
58	Architectural support for secure virtualization under a vulnerable hypervisor.
47	A compile-time managed multi-level register file hierarchy.
45	A new case for the TAGE branch predictor.
42	A resistive TCAM accelerator for data-intensive computing.
40	Dataflow execution of sequential imperative programs on multicore architectures.
40	Proactive instruction fetch.
34	Encore: low-cost, fine-grained transient fault recovery.
32	Packet chaining: efficient single-cycle allocation for on-chip networks.
31	System-level integrated server architectures for scale-out datacenters.
27	Accelerating microprocessor silicon validation by exposing ISA diversity.
27	CoreRacer: a practical memory race recorder for multicore x86 TSO processors.
27	Formally enhanced runtime verification to ensure NoC functional correctness.
27	Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits.
24	FeatherWeight: low-cost optical arbitration with QoS support.
22	Idempotent processor architecture.
22	Identifying and predicting timing-critical instructions to boost timing speculation.
19	The NoX router.
18	Resilient microring resonator based photonic networks.
17	Manager-client pairing: a framework for implementing coherence hierarchies.
15	A systematic methodology to develop resilient cache coherence protocols.
15	A data layout optimization framework for NUCA-based multicores.
12	A register-file approach for row buffer caches in die-stacked DRAMs.
11	Complementing user-level coarse-grain parallelism with implicit speculative parallelism.
9	ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer.
5	TransCom: transforming stream communication for load balance and efficiency in networks-on-chip.
2	CRAM: coded registers for amplified multiporting.

2010¶

Cited by	Paper title
294	Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior.
210	Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
210	Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories.
146	The ZCache: Decoupling Ways and Associativity.
143	Understanding the Energy Consumption of Dynamic Random Access Memories.
135	SAFER: Stuck-At-Fault Error Recovery for Memories.
106	Sampling Dead Block Prediction for Last-Level Caches.
96	Task Superscalar: An Out-of-Order Task Pipeline.
94	Many-Thread Aware Prefetching Mechanisms for GPGPU Applications.
90	Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies.
88	SD3: A Scalable Approach to Dynamic Data-Dependence Profiling.
83	Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory.
82	Throughput-Effective On-Chip Networks for Manycore Accelerators.
67	ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory.
63	A Dynamically Adaptable Hardware Transactional Memory.
52	ReMAP: A Reconfigurable Heterogeneous Multicore Architecture.
49	Combating Aging with the Colt Duty Cycle Equalizer.
49	A Predictive Model for Dynamic Microarchitectural Adaptivity Control.
48	Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling.
47	Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric.
45	Memory Latency Reduction via Thread Throttling.
45	Fractal Coherence: Scalably Verifiable Cache Coherence.
42	Efficient Selection of Vector Instructions Using Dynamic Programming.
40	Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs.
40	AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection.
38	Adaptive Flow Control for Robust Performance and Energy.
38	Register Cache System Not for Latency Reduction Purpose.
37	Automatic Parallelization in a Binary Rewriter.
37	Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches.
32	Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors.
31	LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support.
29	Scalable Speculative Parallelization on Commodity Clusters.
27	Tolerating Concurrency Bugs Using Transactions as Lifeguards.
27	Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks.
26	STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches.
24	AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors.
24	Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors.
23	ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment.
22	Adaptive and Speculative Slack Simulations of CMPs on CMPs.
22	Hardware Support for Relaxed Concurrency Control in Transactional Memory.
20	Erasing Core Boundaries for Robust and Configurable Performance.
19	Architectural Support for Fair Reader-Writer Locking.
18	Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels.
14	InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing.
14	Virtual Snooping: Filtering Snoops in Virtualized Multi-cores.

2009¶

Cited by	Paper title
1501	McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures.
470	Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping.
465	Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling.
335	Characterizing flash memory: anomalies, observations, and applications.
334	Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance.
296	Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures.
165	Application-aware prioritization mechanisms for on-chip networks.
148	Low-cost router microarchitecture for on-chip networks.
148	SCARAB: a single cycle adaptive routing and bufferless network.
139	A tagless coherence directory.
134	Improving cache lifetime reliability at ultra-low voltages.
129	Coordinated control of multiple prefetchers in multi-core systems.
120	Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches.
118	EazyHTM: eager-lazy hardware transactional memory.
118	Characterizing and mitigating the impact of process variations on phase change based memory systems.
115	Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems.
113	Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip.
105	Light speed arbitration and flow control for nanophotonic interconnects.
99	Complexity effective memory access scheduling for many-core accelerator architectures.
94	A case for dynamic frequency tuning in on-chip networks.
89	Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications.
84	The BubbleWrap many-core: popping cores for sequential acceleration.
81	Finding concurrency bugs with context-aware communication graphs.
78	mSWAT: low-cost hardware fault detection and diagnosis for multicore systems.
77	Low Vccmin fault-tolerant cache with highly predictable performance.
77	ZerehCache: armoring cache architectures in high defect density technologies.
74	Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy.
73	Adaptive line placement with theset balancing cache.
69	Improving memory bank-level parallelism in the presence of prefetching.
63	Tribeca: design for PVT variations with local recovery and fine-grained adaptation.
60	ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem.
56	SHARP control: controlled shared cache management in chip multiprocessors.
55	Proactive transaction scheduling for contention management.
54	Portable compiler optimisation across embedded programs and microarchitectures using machine learning.
49	Execution leases: a hardware-supported mechanism for enforcing strong non-interference.
49	In-network coherence filtering: snoopy coherence without broadcasts.
49	BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support.
43	Reducing peak power with a table-driven adaptive processor core.
42	Offline symbolic analysis for multi-processor execution replay.
41	An hybrid eDRAM/SRAM macrocell to implement first-level data caches.
40	Optimizing shared cache behavior of chip multiprocessors.
40	Multiple clock and voltage domains for chip multi processors.
36	Architecting a chunk-based memory race recorder in modern CMPs.
33	DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage.
32	Characterizing the resource-sharing levels in the UltraSPARC T2 processor.
28	Light64: lightweight hardware support for data race detection during systematic testing of parallel programs.
26	Control flow obfuscation with information flow tracking.
26	Ordering decoupled metadata accesses in multiprocessors.
25	Variation-tolerant non-uniform 3D cache management in die stacked multicore processor.
21	A microarchitecture-based framework for pre- and post-silicon power delivery analysis.
13	Using a configurable processor generator for computer architecture prototyping.
13	POWER7 multi-core processor design.
10	Tree register allocation.
4	Why design must change: rethinking digital design.

2008¶

Cited by	Paper title
211	Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach.
203	Mini-rank: Adaptive DRAM architecture for improving memory power efficiency.
185	Facelift: Hiding and slowing down aging in multicores.
153	Prefetch-Aware DRAM Controllers.
147	Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency.
137	Copy or Discard execution model for speculative parallelization on multicores.
129	A novel cache architecture with enhanced performance and security.
113	Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer.
113	Token flow control.
104	Dependence-aware transactional memory for increased concurrency.
96	From SODA to scotch: The evolution of a wireless baseband processor.
91	EVAL: Utilizing processors with variation-induced timing errors.
90	Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence.
85	Efficient unicast and multicast support for CMPs.
80	The StageNet fabric for constructing resilient multicore systems.
66	Power reduction of CMP communication networks via RF-interconnects.
60	CPR: Composable performance regression for scalable multiprocessor models.
60	Notary: Hardware techniques to enhance signatures.
59	Power to the people: Leveraging human physiological traits to control microprocessor frequency.
52	Online design bug detection: RTL analysis, flexible mechanisms, and evaluation.
51	Reconfigurable energy efficient near threshold cache architectures.
49	Tradeoffs in designing accelerator architectures for visual computing.
49	Toward a multicore architecture for real-time ray-tracing.
47	NBTI tolerant microarchitecture design in the presence of process variation.
45	Token tenure: PATCHing token counting using directory-based cache coherence.
44	Temporal instruction fetch streaming.
41	Adaptive data compression for high-performance low-power on-chip networks.
41	Low-power, high-performance analog neural branch prediction.
34	Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.
29	Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology.
28	A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags.
28	A performance-correctness explicitly-decoupled architecture.
26	Shapeshifter: Dynamically changing pipeline width and speed to address process variations.
19	Verification of chip multiprocessor memory systems using a relaxed scoreboard.
19	Implementing high availability memory with a duplication cache.
19	Evaluating the effects of cache redundancy on profit.
18	Strategies for mapping dataflow blocks to distributed hardware.
15	Testudo: Heavyweight security analysis via statistical sampling.
15	SHARK: Architectural support for autonomic protection against stealth by rootkit exploits.
9	A distributed processor state management architecture for large-window processors.
2	Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins.
1	Microarchitecture in the system-level integration era.

2007¶

Cited by	Paper title
449	Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0.
420	Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors.
365	Flattened Butterfly Topology for On-Chip Networks.
358	Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow.
247	Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding.
241	Argus: Low-Cost, Comprehensive Error Detection in Simple Cores.
197	A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs.
192	Composable Lightweight Processors.
184	Penelope: The NBTI-Aware Processor.
179	Revisiting the Sequential Programming Model for Multi-Core.
172	Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs.
157	FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators.
136	Implementing Signatures for Transactional Memory.
127	A Framework for Providing Quality of Service in Chip Multi-Processors.
112	Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing.
108	Process Variation Tolerant 3T1D-Based Cache Architectures.
107	Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation.
102	Self-calibrating Online Wearout Detection.
99	Leveraging 3D Technology for Improved Reliability.
92	Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly.
75	Microarchitectural Design Space Exploration Using an Architecture-Centric Approach.
62	A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy.
61	Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures.
59	Scavenger: A New Last Level Cache Architecture with Global Block Priority.
52	Emulating Optimal Replacement with a Shepherd Cache.
38	Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors.
36	Impact of Cache Coherence Protocols on the Processing of Network Traffic.
34	Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications.
34	Informed Microarchitecture Design Space Exploration Using Workload Dynamics.
34	Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache.
28	The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration.
24	Effective Optimistic-Checker Tandem Core Design through Architectural Pruning.
20	Global Multi-Threaded Instruction Scheduling.
16	Time Interpolation: So Many Metrics, So Few Registers.
14	Optimal versus Heuristic Global Code Scheduling.

2006¶

Cited by	Paper title
941	Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches.
603	An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget.
552	Die Stacking (3D) Microarchitecture.
370	LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks.
358	Managing Distributed, Shared L2 Caches through OS-Level Page Allocation.
334	Fair Queuing Memory Systems.
302	Leveraging Optical Technology in Future Bus-based Chip Multiprocessors.
299	ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers.
282	Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management.
237	ASR: Adaptive Selective Replication for CMP Caches.
209	Architectural Support for Software Transactional Memory.
180	Distributed Microarchitectural Protocols in the TRIPS Prototype Processor.
159	Reunion: Complexity-Effective Multicore Redundancy.
137	In-Network Cache Coherence.
130	A Predictive Performance Model for Superscalar Processors.
128	Mitigating the Impact of Process Variations on Processor Register Files and Execution Units.
110	Yield-Aware Cache Architectures.
92	Adaptive Caches: Effective Shaping of Cache Behavior to Workloads.
90	Memory Prefetching Using Adaptive Stream Detection.
76	Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers.
66	NoSQ: Store-Load Communication without a Store Queue.
60	Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions.
59	Coherence Ordering for Ring-based Chip Multiprocessors.
58	Fire-and-Forget: Load/Store Scheduling with No Store Queue at All.
57	Fairness and Throughput in Switch on Event Multithreading.
50	Scalable Cache Miss Handling for High Memory-Level Parallelism.
48	CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs.
47	Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units.
44	Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware.
36	Dataflow Predication.
36	Support for High-Frequency Streaming in CMPs.
30	PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection.
28	Authentication Control Point and Its Implications For Secure Processor Design.
27	Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths.
26	Merging Head and Tail Duplication for Convergent Hyperblock Formation.
20	A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design.
18	DMDC: Delayed Memory Dependence Checking through Age-Based Filtering.
18	Virtually Pipelined Network Memory.
16	Serialization-Aware Mini-Graphs: Performance with Fewer Resources.
11	Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection.
11	Memory Protection through Dynamic Access Control.
5	Data-Dependency Graph Transformations for Superblock Scheduling.

2005¶

Cited by	Paper title
278	Automatic Thread Extraction with Decoupled Software Pipelining.
211	A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance.
156	Stream Programming on General-Purpose Processors.
115	A Mechanism for Online Diagnosis of Hard Faults in Microprocessors.
97	Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor.
91	The TM3270 Media-Processor.
77	A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation.
69	Scalable Store-Load Forwarding via Store Queue Index Prediction.
64	The Cell Processor Architecture.
61	Shader Performance Analysis on a Modern GPU Architecture.
57	Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities.
55	Thermal Management of On-Chip Caches Through Power Density Minimization.
52	Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution.
52	Improving Region Selection in Dynamic Optimization Systems.
49	Address-Indexed Memory Disambiguation and Store-to-Load Forwarding.
48	ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing.
45	Continuous Path and Edge Profiling.
42	“”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “
40	A Criticality Analysis of Clustering in Superscalar Processors.
38	Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors.
35	Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns.
34	uComplexity: Estimating Processor Design Effort.
34	Store Memory-Level Parallelism Optimizations for Commercial Applications.
30	Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System.
29	How to Fake 1000 Registers.
25	Exploiting Vector Parallelism in Software Pipelined Loops.
21	Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines.
14	Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows.
10	The Future Evolution of High-Performance Microprocessors.
10	Efficient Use of Invisible Registers in Thumb Code.
6	Incremental Commit Groups for Non-Atomic Trace Processing.
0	Message from the General Chairs.
0	Message from the Program Co-Chairs.