HPCA¶

All¶

Cited by	Paper title	Year
1235	Amdahl’s Law in the multicore era.	2008
1022	Evaluating MapReduce for Multi-core and Multiprocessor Systems.	2007
770	LogTM: log-based transactional memory.	2006
617	System level analysis of fast, per-core DVFS using on-chip switching regulators.	2008
616	Unbounded Transactional Memory.	2005
589	Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture.	2005
499	Power Efficient Processor Architecture and The Cell Processor.	2005
395	The Soft Error Problem: An Architectural Perspective.	2005
386	Graphite: A distributed parallel simulator for multicores.	2010
373	LogTM-SE: Decoupling Hardware Transactional Memory from Caches.	2007
348	A novel architecture of the 3D stacked MRAM L2 cache for CMPs.	2009
336	Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems.	2008
318	ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers.	2010
315	Regional congestion awareness for load balance in networks-on-chip.	2008
266	Dynamic power-performance adaptation of parallel computation on chip multiprocessors.	2006
262	Relaxing non-volatility for fast and energy-efficient STT-RAM caches.	2011
243	Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers.	2007
231	CMP network-on-chip overlaid with multi-band RF-interconnect.	2008
229	Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing.	2010
229	A quantitative performance analysis model for GPU architectures.	2011
213	An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth.	2010
211	Cluster-level feedback power control for performance optimization.	2008
206	BigDataBench: A big data benchmark suite from internet services.	2014
198	Chip Multithreading: Opportunities and Challenges.	2005
191	High performance network virtualization with SR-IOV.	2010
190	Concurrent Direct Network Access for Virtual Machine Monitors.	2007
183	Performance, Energy, and Thermal Considerations for SMT and CMP Architectures.	2005
183	Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs.	2009
177	SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs.	2005
177	Dynamically Specialized Datapaths for energy efficient computing.	2011
176	BulletProof: a defect-tolerant CMP switch architecture.	2006
170	Express Cube Topologies for on-Chip Interconnects.	2009
169	CMP design space exploration subject to physical constraints.	2006
166	Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing.	2011
164	Construction and use of linear regression models for processor performance analysis.	2006
159	Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors.	2007
154	Thread block compaction for efficient SIMT control flow.	2011
147	A Scalable, Non-blocking Approach to Transactional Memory.	2007
144	Application-Level Correctness and its Impact on Fault Tolerance.	2007
143	Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads.	2006
143	FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar.	2010
141	An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors.	2007
140	I-CASH: Intelligently Coupled Array of SSD and HDD.	2011
138	FlexiTaint: A programmable accelerator for dynamic taint propagation.	2008
137	HARD: Hardware-Assisted Lockset-based Race Detection.	2007
136	A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks.	2005
136	FREE-p: Protecting non-volatile memory against both hard and soft errors.	2011
132	A comprehensive approach to DRAM power management.	2008
131	Operating system support for overlapping-ISA heterogeneous multi-core architectures.	2010
131	CHIPPER: A low-complexity bufferless deflection router.	2011
130	Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM.	2013
127	Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM.	2006
125	The common case transactional behavior of multithreaded programs.	2006
125	A Burst Scheduling Access Reordering Mechanism.	2007
123	Adaptive Spill-Receive for robust high-performance caching in CMPs.	2009
120	Application performance modeling in a virtualized environment.	2010
119	Variation-aware dynamic voltage/frequency scaling.	2009
117	Transition Phase Classification and Prediction.	2005
117	Computational sprinting.	2012
115	Scalable architectural support for trusted software.	2010
113	Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications.	2007
112	Phase characterization for power: evaluating control-flow-based and event-counter-based techniques.	2006
112	A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement.	2010
111	Characterizing and Comparing Prevailing Simulation Techniques.	2005
110	Cuckoo directory: A scalable directory for many-core systems.	2011
110	Beyond block I/O: Rethinking traditional storage primitives.	2011
109	Improving write operations in MLC phase change memory.	2012
107	C-Oracle: Predictive thermal management for data centers.	2008
107	Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures.	2013
106	Elastic-buffer flow control for on-chip networks.	2009
105	Perturbation-based Fault Screening.	2007
104	Improving Multiple-CMP Systems Using Token Coherence.	2005
103	Uncovering hidden loop level parallelism in sequential applications.	2008
103	Designing a processor from the ground up to allow voltage/reliability tradeoffs.	2010
103	Tiered-latency DRAM: A low latency and low cost DRAM architecture.	2013
102	MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging.	2007
102	Balancing DRAM locality and parallelism in shared memory CMP systems.	2012
97	Illustrative Design Space Studies with Microarchitectural Regression Models.	2007
97	Compute Caches.	2017
95	Eliminating microarchitectural dependency from Architectural Vulnerability.	2009
95	Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches.	2009
94	Prediction router: Yet another low latency on-chip router architecture.	2009
92	The case for GPGPU spatial multitasking.	2012
91	A Performance Comparison of DRAM Memory System Optimizations for SMT Processors.	2005
91	CORD: cost-effective (and nearly overhead-free) order-recording and data race detection.	2006
91	Interval simulation: Raising the level of abstraction in architectural simulation.	2010
90	Checkpointed Early Load Retirement.	2005
90	CHOP: Adaptive filter-based DRAM caching for CMP server platforms.	2010
89	PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches.	2009
89	Accurate microarchitecture-level fault modeling for studying hardware faults.	2009
88	Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems.	2009
88	SCD: A scalable coherence directory with flexible sharer set encoding.	2012
88	TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture.	2012
86	DMA-aware memory energy management.	2006
86	Exploiting parallelism and structure to accelerate the simulation of chip multi-processors.	2006
86	ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers.	2006
86	A low-radix and low-diameter 3D interconnection network design.	2009
86	CAMP: A technique to estimate per-structure power at run-time using a few simple parameters.	2009
84	A Unified Compressed Memory Hierarchy.	2005
84	Performance and power optimization through data compression in Network-on-Chip architectures.	2008
84	Blueshift: Designing processors for timing speculation from the ground up.	2009
83	Towards scalable, energy-efficient, bus-based on-chip networks.	2010
81	Trends in High-Performance Processors.	2005
81	Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance.	2010
80	MISE: Providing performance predictability and improving fairness in shared main memory systems.	2013
80	Cache coherence for GPU architectures.	2013
79	Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling.	2007
79	CPU-assisted GPGPU on fused CPU-GPU architectures.	2012
79	High-performance and energy-efficient mobile web browsing on big/little systems.	2013
78	Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors.	2005
77	Understanding the performance-temperature interactions in disk I/O of server workloads.	2006
77	High performance file I/O for the Blue Gene/L supercomputer.	2006
76	A first-order fine-grained multithreaded throughput model.	2009
76	SolarCore: Solar energy driven multi-core architecture power management.	2011
76	ESESC: A fast multicore simulator using Time-Based Sampling.	2013
75	Distributing the Frontend for Temperature Reduction.	2005
75	HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing.	2011
74	DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors.	2008
73	Bridging the computation gap between programmable processors and hardwired accelerators.	2009
73	Calvin: Deterministic or not? Free will to choose.	2011
73	MRPB: Memory request prioritization for massively parallel processors.	2014
72	Shared last-level TLBs for chip multiprocessors.	2011
72	Runnemede: An architecture for Ubiquitous High-Performance Computing.	2013
72	Accelerating write by exploiting PCM asymmetries.	2013
68	CloudCache: Expanding and shrinking private caches.	2011
68	Improving DRAM performance by parallelizing refreshes with accesses.	2014
67	SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors.	2005
66	Voltage emergency prediction: Using signatures to reduce operating margins.	2009
65	A Memory-Level Parallelism Aware Fetch Policy for SMT Processors.	2007
65	An OS-based alternative to full hardware coherence on tiled CMPs.	2008
65	Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy.	2009
65	In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects.	2009
65	Warped register file: A power efficient register file for GPGPUs.	2013
64	On the Limits of Leakage Power Reduction in Caches.	2005
64	Interactions Between Compression and Prefetching in Chip Multiprocessors.	2007
64	Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines.	2007
64	Automated microprocessor stressmark generation.	2008
63	An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing.	2007
63	Hardware-software integrated approaches to defend against software cache-based side channel attacks.	2009
62	Addressing system-level trimming issues in on-chip nanophotonic networks.	2011
62	A case for guarded power gating for multi-core processors.	2011
61	Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors.	2005
61	A Small, Fast and Low-Power Register File by Bit-Partitioning.	2005
61	Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system.	2011
60	Practical and secure PCM systems by online detection of malicious write streams.	2011
60	Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip.	2012
59	Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications.	2005
59	Worth their watts? - an empirical study of datacenter servers.	2010
59	Reducing GPU offload latency via fine-grained CPU-GPU synchronization.	2013
58	Efficient scrub mechanisms for error-prone emerging memories.	2012
57	Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat.	2007
56	MRR: Enabling fully adaptive multicast routing for CMP interconnection networks.	2009
56	Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics.	2009
56	Programming the cloud.	2011
56	Improving GPGPU resource utilization through alternative thread block scheduling.	2014
55	Application-to-core mapping policies to reduce memory system interference in multi-core systems.	2013
54	Enterprise IT Trends and Implications for Architecture Research.	2005
54	Archipelago: A polymorphic cache design for enabling robust near-threshold operation.	2011
54	Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips.	2012
53	Microarchitectural Wire Management for Performance and Power in Partitioned Architectures.	2005
53	An Efficient Programmable 10 Gigabit Ethernet Network Interface Card.	2005
53	Navigating heterogeneous processors with market mechanisms.	2013
52	iCFP: Tolerating all-level cache misses in in-order processors.	2009
52	Overcoming the challenges of crossbar resistive memory architectures.	2015
51	Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks.	2013
50	A case for Refresh Pausing in DRAM memory systems.	2013
50	Breaking the on-chip latency barrier using SMART.	2013
50	Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.	2015
49	InfoShield: a security architecture for protecting information usage in memory.	2006
49	Design and implementation of the blue gene/P snoop filter.	2008
49	Design and implementation of software-managed caches for multicores with local memory.	2009
49	Energy-efficient interconnect via Router Parking.	2013
49	Power-performance co-optimization of throughput core architecture using resistive memory.	2013
49	Architecture exploration for ambient energy harvesting nonvolatile processors.	2015
48	NUcache: An efficient multicore cache organization based on Next-Use distance.	2011
48	QuickIA: Exploring heterogeneous architectures on real prototypes.	2012
48	SNNAP: Approximate computing on programmable SoCs via neural acceleration.	2015
47	Scatter-Add in Data Parallel Architectures.	2005
47	Thread-safe dynamic binary translation using transactional memory.	2008
47	Dacota: Post-silicon validation of the memory subsystem in multi-core designs.	2009
47	Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs.	2012
47	Optimizing virtual machine scheduling in NUMA multicore systems.	2013
47	i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations.	2013
46	Adaptive placement and migration policy for an STT-RAM-based hybrid cache.	2014
45	An approach for implementing efficient superscalar CISC processors.	2006
45	A new server I/O architecture for high speed networks.	2011
44	Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols.	2011
44	Fast thread migration via cache working set prediction.	2011
44	Dynamically heterogeneous cores through 3D resource pooling.	2012
44	Enabling distributed generation powered sustainable high-performance data center.	2013
44	A detailed GPU cache model based on reuse distance theory.	2014
43	Heat Stroke: Power-Density-Based Denial of Service in SMT.	2005
43	Coset coding to extend the lifetime of memory.	2013
43	NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules.	2015
42	EXCES: External caching in energy saving storage systems.	2008
42	Simple virtual channel allocation for high throughput and high frequency on-chip routers.	2010
42	Design, integration and implementation of the DySER hardware accelerator into OpenSPARC.	2012
41	Colorama: Architectural Support for Data-Centric Synchronization.	2007
41	MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy.	2011
41	Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications.	2012
41	Staged Reads: Mitigating the impact of DRAM writes on DRAM reads.	2012
41	EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing.	2013
41	DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture.	2014
41	Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs.	2014
41	Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.	2015
40	Accelerating and Adapting Precomputation Threads for Effcient Prefetching.	2007
40	Improving cache performance using read-write partitioning.	2014
40	MemZip: Exploring unconventional benefits from memory compression.	2014
40	Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories.	2015
39	Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping.	2007
39	ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture.	2010
39	Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism.	2011
39	AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture.	2012
39	The dual-path execution model for efficient GPU control flow.	2013
38	A Domain-Specific On-Chip Network Design for Large Scale Cache Systems.	2007
38	Runtime validation of memory ordering using constraint graph checking.	2008
37	Efficient complex operators for irregular codes.	2011
37	System-level implications of disaggregated memory.	2012
37	Timing channel protection for a shared memory controller.	2014
37	Supporting x86-64 address translation for 100s of GPU lanes.	2014
36	Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures.	2007
36	An intelligent IT infrastructure for the future.	2009
36	Optimizing Google’s warehouse scale computers: The NUMA experience.	2013
36	Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound.	2013
35	A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures.	2007
35	GPGPU performance and power estimation using machine learning.	2015
34	Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems.	2005
34	Supporting highly-decoupled thread-level redundancy for parallel programs.	2008
34	Reconciling specialization and flexibility through compound circuits.	2009
34	Fast complete memory consistency verification.	2009
34	Abstraction and microarchitecture scaling in early-stage power modeling.	2011
34	Disintegrated control for energy-efficient and heterogeneous memory systems.	2013
34	QuickRelease: A throughput-oriented approach to release consistency on GPUs.	2014
34	Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications.	2015
33	A decoupled KILO-instruction processor.	2006
33	Characterization of Direct Cache Access on multi-core systems and 10GbE.	2009
33	Power-efficient computing for compute-intensive GPGPU applications.	2013
33	Increasing TLB reach by exploiting clustering in page translations.	2014
32	A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems.	2010
32	Bloom Filter Guided Transaction Scheduling.	2011
32	Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies.	2013
31	Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors.	2006
31	Fundamental performance constraints in horizontal fusion of in-order cores.	2008
30	UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all.	2010
30	Hardware/software techniques for DRAM thermal management.	2011
30	Achieving uniform performance and maximizing throughput in the presence of heterogeneity.	2011
30	Efficient data streaming with on-chip accelerators: Opportunities and challenges.	2011
30	Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting.	2015
29	DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance.	2010
29	SCRAP: Architecture for signature-based protection from Code Reuse Attacks.	2013
29	Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules.	2014
29	Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning.	2014
29	Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers.	2014
29	Warp-level divergence in GPUs: Characterization, impact, and mitigation.	2014
28	Multithreaded Value Prediction.	2005
28	Increasing the cache efficiency by eliminating noise.	2006
28	Practical off-chip meta-data for temporal memory streaming.	2009
28	Explaining cache SER anomaly using DUE AVF measurement.	2010
28	MORSE: Multi-objective reconfigurable self-optimizing memory scheduler.	2012
28	Modeling performance variation due to cache sharing.	2013
28	Scaling towards kilo-core processors with asymmetric high-radix topologies.	2013
28	Dynamic management of TurboMode in modern multi-core chips.	2014
28	TSO-CC: Consistency directed cache coherence for TSO.	2014
27	Software Directed Issue Queue Power Reduction.	2005
27	Efficient instruction schedulers for SMT processors.	2006
27	Single-level integrity and confidentiality protection for distributed shared memory multiprocessors.	2008
27	ACCESS: Smart scheduling for asymmetric cache CMPs.	2011
27	Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors.	2011
26	Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE.	2005
26	Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses.	2005
26	High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation.	2008
26	Power-Efficient DRAM Speculation.	2008
26	Power shifting in Thrifty Interconnection Network.	2011
26	JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers.	2012
26	Statistical performance comparisons of computers.	2012
26	Exploiting thermal energy storage to reduce data center capital and operating expenses.	2014
26	Understanding GPU errors on large-scale HPC systems and the implications for system design and operation.	2015
26	Exploiting compressed block size as an indicator of future reuse.	2015
26	Coordinated static and dynamic cache bypassing for GPUs.	2015
26	Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory.	2015
25	Optical Interconnect Opportunities for Future Server Memory Systems.	2007
25	Data-triggered threads: Eliminating redundant computation.	2011
25	MP3: Minimizing performance penalty for power-gating of Clos network-on-chip.	2014
25	Mascar: Speeding up GPU warps by reducing memory pitstops.	2015
25	CATalyst: Defeating last-level cache side channel attacks in cloud computing.	2016
24	Exploiting Postdominance for Speculative Parallelization.	2007
24	Address-branch correlation: A novel locality for long-latency hard-to-predict branches.	2008
24	BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution.	2010
24	?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory.	2012
24	Layout-conscious random topologies for HPC off-chip interconnects.	2013
24	ECM: Effective Capacity Maximizer for high-performance compressed caching.	2013
24	NUAT: A non-uniform access time memory controller.	2014
24	Quantifying sources of error in McPAT and potential impacts on architectural studies.	2015
24	Power punch: Towards non-blocking power-gating of NoC routers.	2015
23	Completely verifying memory consistency of test program executions.	2006
23	Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation.	2008
23	Architectural Contesting.	2009
23	QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers.	2014
23	ChargeCache: Reducing DRAM latency by exploiting row access locality.	2016
22	Value Based BTB Indexing for indirect jump prediction.	2010
22	Storage free confidence estimation for the TAGE branch predictor.	2011
22	Power balanced pipelines.	2012
22	Network congestion avoidance through Speculative Reservation.	2012
22	Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction.	2016
21	Software-hardware cooperative memory disambiguation.	2006
21	Improving Branch Prediction and Predicated Execution in Out-of-Order Processors.	2007
21	Runahead Threads to improve SMT performance.	2008
21	Decoupled dynamic cache segmentation.	2012
21	Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers.	2015
20	Store vectors for scalable memory dependence prediction and scheduling.	2006
20	Roughness of microarchitectural design topologies and its implications for optimization.	2008
20	Offline symbolic analysis to infer Total Store Order.	2011
20	MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets.	2012
20	Cost effective data center servers.	2013
20	XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures.	2015
19	Feedback mechanisms for improving probabilistic memory prefetching.	2009
19	Soft error vulnerability aware process variation mitigation.	2009
19	IADVS: On-demand performance for interactive applications.	2010
19	HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor.	2011
19	Reducing the cost of persistence for nonvolatile heaps in end user devices.	2014
19	Concurrent and consistent virtual machine introspection with hardware transactional memory.	2014
19	CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture.	2014
19	Stash directory: A scalable directory for many-core coherence.	2014
19	Priority-based cache allocation in throughput processors.	2015
19	Prediction-based superpage-friendly TLB designs.	2015
19	Unlocking bandwidth for GPUs in CC-NUMA systems.	2015
19	Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.	2016
18	Prediction of CPU idle-busy activity pattern.	2008
18	Checked Load: Architectural support for JavaScript type-checking on mobile processors.	2011
18	WEST: Cloning data cache behavior using Stochastic Traces.	2012
18	Supporting efficient collective communication in NoCs.	2012
18	Pacman: Tolerating asymmetric data races with unintrusive hardware.	2012
18	Improving multi-core performance using mixed-cell cache architecture.	2013
18	Worm-Bubble Flow Control.	2013
18	Sprinkler: Maximizing resource utilization in many-chip solid state disks.	2014
18	PVCoherence: Designing flat coherence protocols for scalable verification.	2014
18	Supporting superpages in non-contiguous physical memory.	2015
18	BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing.	2015
17	Probabilistic counter updates for predictor hysteresis and stratification.	2006
17	LiteTM: Reducing transactional state overhead.	2010
17	Locality-aware data replication in the Last-Level Cache.	2014
17	Spare register aware prefetching for graph algorithms on GPUs.	2014
17	Implications of high energy proportional servers on cluster-wide energy proportionality.	2014
17	Practical data value speculation for future high-end processors.	2014
17	Talus: A simple way to remove cliffs in cache performance.	2015
17	Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies.	2015
16	Criticality-based optimizations for efficient load processing.	2009
16	SIF: Overcoming the limitations of SIMD devices via implicit permutation.	2010
16	StimulusCache: Boosting performance of chip multiprocessors with excess cache.	2010
16	Delay-Hiding energy management mechanisms for DRAM.	2010
16	Network within a network approach to create a scalable high-radix router microarchitecture.	2012
16	Tag tables.	2015
15	Parabix: Boosting the efficiency of text processing on commodity processors.	2012
15	Cache restoration for highly partitioned virtualized systems.	2012
15	Exploring high-performance and energy proportional interface for phase change memory systems.	2013
15	Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks.	2014
15	A scalable multi-path microarchitecture for efficient GPU control flow.	2014
15	CAFO: Cost aware flip optimization for asymmetric memories.	2015
15	Malware-aware processors: A framework for efficient online malware detection.	2015
14	Implications of Device Timing Variability on Full Chip Timing.	2007
14	PEEP: Exploiting predictability of memory dependences in SMT processors.	2008
14	Adaptive Reliability Chipkill Correct (ARCC).	2013
14	Precision-aware soft error protection for GPUs.	2014
14	Revolver: Processor architecture for power efficient loop execution.	2014
14	Understanding contention-based channels and using them for defense.	2015
13	Exascale computing: The challenges and opportunities in the next decade.	2010
13	Accelerating business analytics applications.	2012
13	Undersubscribed threading on clustered cache architectures.	2014
13	Domain knowledge based energy management in handhelds.	2015
13	Paying to save: Reducing cost of colocation data center via rewards.	2015
13	Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning.	2016
12	Chip-multiprocessing and beyond.	2006
12	PaCo: Probability-based path confidence prediction.	2008
12	Adaptive Set-Granular Cooperative Caching.	2012
12	TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network.	2013
12	Dynamically detecting and tolerating IF-Condition Data Races.	2014
12	DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead.	2014
12	Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management.	2014
12	Scaling distributed cache hierarchies through computation and data co-scheduling.	2015
11	Tapping ZettaRAMTMfor Low-Power Memory Systems.	2005
11	Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions.	2005
11	Skinflint DRAM system: Minimizing DRAM chip writes for low power.	2013
11	Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling.	2013
11	Accordion: Toward soft Near-Threshold Voltage Computing.	2014
11	3D stacking of high-performance processors.	2014
11	Augmenting low-latency HPC network with free-space optical links.	2015
11	TABLA: A unified template-based framework for accelerating statistical machine learning.	2016
10	Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again.	2008
10	Speculative instruction validation for performance-reliability trade-off.	2008
10	COMIC++: A software SVM system for heterogeneous multicore accelerator clusters.	2010
10	BulkSMT: Designing SMT processors for atomic-block execution.	2012
10	Illusionist: Transforming lightweight cores into aggressive cores on demand.	2013
10	Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches.	2013
10	STM: Cloning the spatial and temporal memory access behavior.	2014
10	Strategies for anticipating risk in heterogeneous system design.	2014
10	Overcoming far-end congestion in large-scale networks.	2015
10	Revisiting virtual L1 caches: A practical design using dynamic synonym remapping.	2016
10	Energy-efficient address translation.	2016
9	Exploiting criticality to reduce bottlenecks in distributed uniprocessors.	2011
9	Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model.	2013
9	RECAP: A region-based cure for the common cold (cache).	2013
9	SCOC: High-radix switches made of bufferless clos networks.	2015
9	FTXen: Making hypervisor resilient to hardware faults on relaxed cores.	2015
9	Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing.	2016
9	A performance analysis framework for optimizing OpenCL applications on FPGAs.	2016
9	HRL: Efficient and flexible reconfigurable logic for near-data processing.	2016
8	Performance-aware speculation control using wrong path usefulness prediction.	2008
8	Handling branches in TLS systems with Multi-Path Execution.	2010
8	Hardware/software-based diagnosis of load-store queues using expandable activity logs.	2011
8	Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons.	2013
8	A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks.	2014
8	Reducing read latency of phase change memory via early read and Turbo Read.	2015
8	Warped-preexecution: A GPU pre-execution approach for improving latency hiding.	2016
8	A case for toggle-aware compression for GPU systems.	2016
7	Architectural support for synchronization-free deterministic parallel programming.	2012
7	A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O.	2013
7	A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments.	2013
7	Two level bulk preload branch prediction.	2013
7	High-speed formal verification of heterogeneous coherence hierarchies.	2013
7	Understanding the impact of gate-level physical reliability effects on whole program execution.	2014
7	Atomic SC for simple in-order processors.	2014
7	Transportation-network-inspired network-on-chip.	2014
7	FADE: A programmable filtering accelerator for instruction-grain monitoring.	2014
7	Exploring architectural heterogeneity in intelligent vision systems.	2015
7	GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures.	2015
7	BeBoP: A cost effective predictor infrastructure for superscalar value prediction.	2015
7	Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems.	2015
7	Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family.	2016
7	A large-scale study of soft-errors on GPUs in the field.	2016
7	Atomic persistence for SCM with a non-intrusive backend controller.	2016
6	High-Performance low-vcc in-order core.	2010
6	Flexible register management using reference counting.	2012
6	In-network traffic regulation for Transactional Memory.	2013
6	iPatch: Intelligent fault patching to improve energy efficiency.	2015
6	Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability.	2015
6	Balancing reliability, cost, and performance tradeoffs with FreeFault.	2015
6	Selective GPU caches to eliminate CPU-GPU HW cache coherence.	2016
5	Speculative synchronization and thread management for fine granularity threads.	2006
5	Fabric convergence implications on systems architecture.	2008
5	HARE: Hardware assisted reverse execution.	2010
5	DMA++: on the fly data realignment for on-chip memories.	2010
5	Fg-STP: Fine-Grain Single Thread Partitioning on Multicores.	2011
5	Architectural framework for supporting operating system survivability.	2011
5	A group-commit mechanism for ROB-based processors implementing the X86 ISA.	2013
5	Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs.	2014
5	CDTT: Compiler-generated data-triggered threads.	2014
5	Scalably verifiable dynamic power management.	2014
5	GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management.	2014
5	High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches.	2015
5	Increasing multicore system efficiency through intelligent bandwidth shifting.	2015
5	“Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “	2015
5	CiDRA: A cache-inspired DRAM resilience architecture.	2015
5	Scalable communication architecture for network-attached accelerators.	2015
5	VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors.	2015
5	Efficient footprint caching for Tagless DRAM Caches.	2016
5	A complete key recovery timing attack on a GPU.	2016
5	McVerSi: A test generation framework for fast memory consistency verification in simulation.	2016
5	Pushing the limits of accelerator efficiency while retaining programmability.	2016
5	Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller.	2016
5	Restore truncation for performance improvement in future DRAM systems.	2016
5	Modeling cache performance beyond LRU.	2016
5	SLaC: Stage laser control for a flattened butterfly network.	2016
4	Interconnect-Centric Computing.	2007
4	Branch-mispredict level parallelism (BLP) for control independence.	2008
4	LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores.	2010
4	BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks.	2012
4	Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor.	2012
4	How to implement effective prediction and forwarding for fusable dynamic multicore architectures.	2013
4	Correction prediction: Reducing error correction latency for on-chip memories.	2015
4	CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM.	2016
4	ScalCore: Designing a core for voltage scalability.	2016
4	Best-offset hardware prefetching.	2016
4	Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines.	2016
4	Towards high performance paged memory for GPUs.	2016
4	SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.	2017
4	Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.	2017
3	Petascale Computing Research Challenges - A Manycore Perspective.	2007
3	Lightweight predication support for out of order processors.	2009
3	MOPED: Orchestrating interprocess message data on CMPs.	2011
3	Safe and efficient supervised memory systems.	2011
3	Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee.	2016
3	LASER: Light, Accurate Sharing dEtection and Repair.	2016
3	A low power software-defined-radio baseband processor for the Internet of Things.	2016
3	Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems.	2016
3	Symbiotic job scheduling on the IBM POWER8.	2016
3	MaPU: A novel mathematical computing architecture.	2016
3	Transparent and Efficient CFI Enforcement with Intel Processor Trace.	2017
2	Industrial Perspectives: Platform Design Challenges with Many cores.	2006
2	Opportunities beyond single-core microprocessors.	2009
2	Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach.	2014
2	Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis.	2015
2	Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems.	2015
2	Approximating warps with intra-warp operand value similarity.	2016
2	Software transparent dynamic binary translation for coarse-grain reconfigurable architectures.	2016
2	Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors.	2016
2	Cost effective physical register sharing.	2016
2	A low-power hybrid reconfigurable architecture for resistive random-access memories.	2016
2	LiveSim: Going live with microarchitecture simulation.	2016
2	Core tunneling: Variation-aware voltage noise mitigation in GPUs.	2016
2	Venice: Exploring server architectures for effective resource sharing.	2016
2	PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning.	2017
1	Architecting for power management: The IBM POWER7TMapproach.	2010
1	Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors.	2013
1	Low-overhead and high coverage run-time race detection through selective meta-data management.	2014
1	DVFS for NoCs in CMPs: A thread voting approach.	2016
1	DUANG: Fast and lightweight page migration in asymmetric memory systems.	2016
1	PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory.	2016
1	Minimal disturbance placement and promotion.	2016
1	iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs.	2016
1	Efficient synthetic traffic models for large, complex SoCs.	2016
1	Efficient GPU hardware transactional memory through early conflict resolution.	2016
1	The runahead network-on-chip.	2016
1	RADAR: Runtime-assisted dead region management for last-level caches.	2016
1	SizeCap: Efficiently handling power surges in fuel cell powered data centers.	2016
1	A market approach for handling power emergencies in multi-tenant data center.	2016
1	Cooper: Task Colocation with Cooperative Games.	2017
1	Secure Dynamic Memory Scheduling Against Timing Channel Attacks.	2017
1	Controlled Kernel Launch for Dynamic Parallelism in GPUs.	2017
1	Exploring Hyperdimensional Associative Memory.	2017
1	SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization.	2017
1	ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging.	2017
1	MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories.	2017
1	Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs.	2017
1	Dynamic GPGPU Power Management Using Adaptive Model Predictive Control.	2017
1	SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support.	2017
1	GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks.	2017
1	Near-Ideal Networks-on-Chip for Servers.	2017
0	The Future of Computer Architecture Research: An Industrial Perspective.	2005
0	Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth.	2006
0	Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps.	2006
0	New architectures for a new biology.	2006
0	Intel’s Tera-scale Computing Project: The first five years, the next five years.	2008
0	Compilers and parallel computing systems.	2008
0	Industrial perspectives panel.	2009
0	Multi-core demands multi-interfaces.	2009
0	Is hardware innovation over?	2010
0	Extreme scale computing: Challenges and opportunities.	2010
0	How’s the parallel computing revolution going?	2011
0	Improving in-memory database index performance with Intel®Transactional Synchronization Extensions.	2014
0	Run-time monitoring with adjustable overhead using dataflow-guided filtering.	2015
0	Design and implementation of a mobile storage leveraging the DRAM interface.	2016
0	SCsafe: Logging sequential consistency violations continuously and precisely.	2016
0	PABST: Proportionally Allocated Bandwidth at the Source and Target.	2017
0	Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources.	2017
0	BRAVO: Balanced Reliability-Aware Voltage Optimization.	2017
0	Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads.	2017
0	Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links.	2017
0	Design and Analysis of an APU for Exascale Computing.	2017
0	Boomerang: A Metadata-Free Architecture for Control Flow Delivery.	2017
0	Partial Row Activation for Low-Power DRAM System.	2017
0	High-Bandwidth Low-Latency Approximate Interconnection Networks.	2017
0	Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence.	2017
0	Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies.	2017
0	Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings.	2017
0	Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks.	2017
0	Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors.	2017
0	Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor.	2017
0	Architecting an Energy-Efficient DRAM System for GPUs.	2017
0	Processing-in-Memory Enabled Graphics Processors for 3D Rendering.	2017
0	Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems.	2017
0	Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices.	2017
0	Random Folded Clos Topologies for Datacenter Networks.	2017
0	Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators.	2017
0	Enabling Effective Module-Oblivious Power Gating for Embedded Processors.	2017
0	Fast Decentralized Power Capping for Server Clusters.	2017
0	Maximizing Cache Performance Under Uncertainty.	2017
0	Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures.	2017
0	Supporting Address Translation for Accelerator-Centric Architectures.	2017
0	G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs.	2017
0	NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture.	2017
0	Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance.	2017
0	Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices.	2017
0	Pilot Register File: Energy Efficient Partitioned Register File for GPUs.	2017
0	FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks.	2017
0	Reliability-Aware Scheduling on Heterogeneous Multicore Processors.	2017
0	KAML: A Flexible, High-Performance Key-Value SSD.	2017
0	A Split Cache Hierarchy for Enabling Data-Oriented Optimizations.	2017
0	Understanding and Optimizing Power Consumption in Memory Networks.	2017
0	SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures.	2017
0	Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking.	2017

2017¶

Cited by	Paper title
97	Compute Caches.
4	SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.
4	Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.
3	Transparent and Efficient CFI Enforcement with Intel Processor Trace.
2	PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning.
1	Cooper: Task Colocation with Cooperative Games.
1	Secure Dynamic Memory Scheduling Against Timing Channel Attacks.
1	Controlled Kernel Launch for Dynamic Parallelism in GPUs.
1	Exploring Hyperdimensional Associative Memory.
1	SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization.
1	ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging.
1	MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories.
1	Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs.
1	Dynamic GPGPU Power Management Using Adaptive Model Predictive Control.
1	SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support.
1	GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks.
1	Near-Ideal Networks-on-Chip for Servers.
0	PABST: Proportionally Allocated Bandwidth at the Source and Target.
0	Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources.
0	BRAVO: Balanced Reliability-Aware Voltage Optimization.
0	Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads.
0	Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links.
0	Design and Analysis of an APU for Exascale Computing.
0	Boomerang: A Metadata-Free Architecture for Control Flow Delivery.
0	Partial Row Activation for Low-Power DRAM System.
0	High-Bandwidth Low-Latency Approximate Interconnection Networks.
0	Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence.
0	Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies.
0	Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings.
0	Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks.
0	Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors.
0	Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor.
0	Architecting an Energy-Efficient DRAM System for GPUs.
0	Processing-in-Memory Enabled Graphics Processors for 3D Rendering.
0	Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems.
0	Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices.
0	Random Folded Clos Topologies for Datacenter Networks.
0	Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators.
0	Enabling Effective Module-Oblivious Power Gating for Embedded Processors.
0	Fast Decentralized Power Capping for Server Clusters.
0	Maximizing Cache Performance Under Uncertainty.
0	Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures.
0	Supporting Address Translation for Accelerator-Centric Architectures.
0	G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs.
0	NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture.
0	Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance.
0	Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices.
0	Pilot Register File: Energy Efficient Partitioned Register File for GPUs.
0	FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks.
0	Reliability-Aware Scheduling on Heterogeneous Multicore Processors.
0	KAML: A Flexible, High-Performance Key-Value SSD.
0	A Split Cache Hierarchy for Enabling Data-Oriented Optimizations.
0	Understanding and Optimizing Power Consumption in Memory Networks.
0	SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures.
0	Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking.

2016¶

Cited by	Paper title
25	CATalyst: Defeating last-level cache side channel attacks in cloud computing.
23	ChargeCache: Reducing DRAM latency by exploiting row access locality.
22	Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction.
19	Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.
13	Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning.
11	TABLA: A unified template-based framework for accelerating statistical machine learning.
10	Revisiting virtual L1 caches: A practical design using dynamic synonym remapping.
10	Energy-efficient address translation.
9	Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing.
9	A performance analysis framework for optimizing OpenCL applications on FPGAs.
9	HRL: Efficient and flexible reconfigurable logic for near-data processing.
8	Warped-preexecution: A GPU pre-execution approach for improving latency hiding.
8	A case for toggle-aware compression for GPU systems.
7	Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family.
7	A large-scale study of soft-errors on GPUs in the field.
7	Atomic persistence for SCM with a non-intrusive backend controller.
6	Selective GPU caches to eliminate CPU-GPU HW cache coherence.
5	Efficient footprint caching for Tagless DRAM Caches.
5	A complete key recovery timing attack on a GPU.
5	McVerSi: A test generation framework for fast memory consistency verification in simulation.
5	Pushing the limits of accelerator efficiency while retaining programmability.
5	Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller.
5	Restore truncation for performance improvement in future DRAM systems.
5	Modeling cache performance beyond LRU.
5	SLaC: Stage laser control for a flattened butterfly network.
4	CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM.
4	ScalCore: Designing a core for voltage scalability.
4	Best-offset hardware prefetching.
4	Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines.
4	Towards high performance paged memory for GPUs.
3	Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee.
3	LASER: Light, Accurate Sharing dEtection and Repair.
3	A low power software-defined-radio baseband processor for the Internet of Things.
3	Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems.
3	Symbiotic job scheduling on the IBM POWER8.
3	MaPU: A novel mathematical computing architecture.
2	Approximating warps with intra-warp operand value similarity.
2	Software transparent dynamic binary translation for coarse-grain reconfigurable architectures.
2	Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors.
2	Cost effective physical register sharing.
2	A low-power hybrid reconfigurable architecture for resistive random-access memories.
2	LiveSim: Going live with microarchitecture simulation.
2	Core tunneling: Variation-aware voltage noise mitigation in GPUs.
2	Venice: Exploring server architectures for effective resource sharing.
1	DVFS for NoCs in CMPs: A thread voting approach.
1	DUANG: Fast and lightweight page migration in asymmetric memory systems.
1	PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory.
1	Minimal disturbance placement and promotion.
1	iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs.
1	Efficient synthetic traffic models for large, complex SoCs.
1	Efficient GPU hardware transactional memory through early conflict resolution.
1	The runahead network-on-chip.
1	RADAR: Runtime-assisted dead region management for last-level caches.
1	SizeCap: Efficiently handling power surges in fuel cell powered data centers.
1	A market approach for handling power emergencies in multi-tenant data center.
0	Design and implementation of a mobile storage leveraging the DRAM interface.
0	SCsafe: Logging sequential consistency violations continuously and precisely.

2015¶

Cited by	Paper title
52	Overcoming the challenges of crossbar resistive memory architectures.
50	Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.
49	Architecture exploration for ambient energy harvesting nonvolatile processors.
48	SNNAP: Approximate computing on programmable SoCs via neural acceleration.
43	NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules.
41	Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.
40	Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories.
35	GPGPU performance and power estimation using machine learning.
34	Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications.
30	Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting.
26	Understanding GPU errors on large-scale HPC systems and the implications for system design and operation.
26	Exploiting compressed block size as an indicator of future reuse.
26	Coordinated static and dynamic cache bypassing for GPUs.
26	Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory.
25	Mascar: Speeding up GPU warps by reducing memory pitstops.
24	Quantifying sources of error in McPAT and potential impacts on architectural studies.
24	Power punch: Towards non-blocking power-gating of NoC routers.
21	Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers.
20	XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures.
19	Priority-based cache allocation in throughput processors.
19	Prediction-based superpage-friendly TLB designs.
19	Unlocking bandwidth for GPUs in CC-NUMA systems.
18	Supporting superpages in non-contiguous physical memory.
18	BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing.
17	Talus: A simple way to remove cliffs in cache performance.
17	Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies.
16	Tag tables.
15	CAFO: Cost aware flip optimization for asymmetric memories.
15	Malware-aware processors: A framework for efficient online malware detection.
14	Understanding contention-based channels and using them for defense.
13	Domain knowledge based energy management in handhelds.
13	Paying to save: Reducing cost of colocation data center via rewards.
12	Scaling distributed cache hierarchies through computation and data co-scheduling.
11	Augmenting low-latency HPC network with free-space optical links.
10	Overcoming far-end congestion in large-scale networks.
9	SCOC: High-radix switches made of bufferless clos networks.
9	FTXen: Making hypervisor resilient to hardware faults on relaxed cores.
8	Reducing read latency of phase change memory via early read and Turbo Read.
7	Exploring architectural heterogeneity in intelligent vision systems.
7	GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures.
7	BeBoP: A cost effective predictor infrastructure for superscalar value prediction.
7	Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems.
6	iPatch: Intelligent fault patching to improve energy efficiency.
6	Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability.
6	Balancing reliability, cost, and performance tradeoffs with FreeFault.
5	High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches.
5	Increasing multicore system efficiency through intelligent bandwidth shifting.
5	“Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “
5	CiDRA: A cache-inspired DRAM resilience architecture.
5	Scalable communication architecture for network-attached accelerators.
5	VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors.
4	Correction prediction: Reducing error correction latency for on-chip memories.
2	Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis.
2	Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems.
0	Run-time monitoring with adjustable overhead using dataflow-guided filtering.

2014¶

Cited by	Paper title
206	BigDataBench: A big data benchmark suite from internet services.
73	MRPB: Memory request prioritization for massively parallel processors.
68	Improving DRAM performance by parallelizing refreshes with accesses.
56	Improving GPGPU resource utilization through alternative thread block scheduling.
46	Adaptive placement and migration policy for an STT-RAM-based hybrid cache.
44	A detailed GPU cache model based on reuse distance theory.
41	DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture.
41	Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs.
40	Improving cache performance using read-write partitioning.
40	MemZip: Exploring unconventional benefits from memory compression.
37	Timing channel protection for a shared memory controller.
37	Supporting x86-64 address translation for 100s of GPU lanes.
34	QuickRelease: A throughput-oriented approach to release consistency on GPUs.
33	Increasing TLB reach by exploiting clustering in page translations.
29	Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules.
29	Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning.
29	Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers.
29	Warp-level divergence in GPUs: Characterization, impact, and mitigation.
28	Dynamic management of TurboMode in modern multi-core chips.
28	TSO-CC: Consistency directed cache coherence for TSO.
26	Exploiting thermal energy storage to reduce data center capital and operating expenses.
25	MP3: Minimizing performance penalty for power-gating of Clos network-on-chip.
24	NUAT: A non-uniform access time memory controller.
23	QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers.
19	Reducing the cost of persistence for nonvolatile heaps in end user devices.
19	Concurrent and consistent virtual machine introspection with hardware transactional memory.
19	CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture.
19	Stash directory: A scalable directory for many-core coherence.
18	Sprinkler: Maximizing resource utilization in many-chip solid state disks.
18	PVCoherence: Designing flat coherence protocols for scalable verification.
17	Locality-aware data replication in the Last-Level Cache.
17	Spare register aware prefetching for graph algorithms on GPUs.
17	Implications of high energy proportional servers on cluster-wide energy proportionality.
17	Practical data value speculation for future high-end processors.
15	Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks.
15	A scalable multi-path microarchitecture for efficient GPU control flow.
14	Precision-aware soft error protection for GPUs.
14	Revolver: Processor architecture for power efficient loop execution.
13	Undersubscribed threading on clustered cache architectures.
12	Dynamically detecting and tolerating IF-Condition Data Races.
12	DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead.
12	Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management.
11	Accordion: Toward soft Near-Threshold Voltage Computing.
11	3D stacking of high-performance processors.
10	STM: Cloning the spatial and temporal memory access behavior.
10	Strategies for anticipating risk in heterogeneous system design.
8	A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks.
7	Understanding the impact of gate-level physical reliability effects on whole program execution.
7	Atomic SC for simple in-order processors.
7	Transportation-network-inspired network-on-chip.
7	FADE: A programmable filtering accelerator for instruction-grain monitoring.
5	Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs.
5	CDTT: Compiler-generated data-triggered threads.
5	Scalably verifiable dynamic power management.
5	GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management.
2	Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach.
1	Low-overhead and high coverage run-time race detection through selective meta-data management.
0	Improving in-memory database index performance with Intel®Transactional Synchronization Extensions.

2013¶

Cited by	Paper title
130	Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM.
107	Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures.
103	Tiered-latency DRAM: A low latency and low cost DRAM architecture.
80	MISE: Providing performance predictability and improving fairness in shared main memory systems.
80	Cache coherence for GPU architectures.
79	High-performance and energy-efficient mobile web browsing on big/little systems.
76	ESESC: A fast multicore simulator using Time-Based Sampling.
72	Runnemede: An architecture for Ubiquitous High-Performance Computing.
72	Accelerating write by exploiting PCM asymmetries.
65	Warped register file: A power efficient register file for GPGPUs.
59	Reducing GPU offload latency via fine-grained CPU-GPU synchronization.
55	Application-to-core mapping policies to reduce memory system interference in multi-core systems.
53	Navigating heterogeneous processors with market mechanisms.
51	Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks.
50	A case for Refresh Pausing in DRAM memory systems.
50	Breaking the on-chip latency barrier using SMART.
49	Energy-efficient interconnect via Router Parking.
49	Power-performance co-optimization of throughput core architecture using resistive memory.
47	Optimizing virtual machine scheduling in NUMA multicore systems.
47	i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations.
44	Enabling distributed generation powered sustainable high-performance data center.
43	Coset coding to extend the lifetime of memory.
41	EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing.
39	The dual-path execution model for efficient GPU control flow.
36	Optimizing Google’s warehouse scale computers: The NUMA experience.
36	Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound.
34	Disintegrated control for energy-efficient and heterogeneous memory systems.
33	Power-efficient computing for compute-intensive GPGPU applications.
32	Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies.
29	SCRAP: Architecture for signature-based protection from Code Reuse Attacks.
28	Modeling performance variation due to cache sharing.
28	Scaling towards kilo-core processors with asymmetric high-radix topologies.
24	Layout-conscious random topologies for HPC off-chip interconnects.
24	ECM: Effective Capacity Maximizer for high-performance compressed caching.
20	Cost effective data center servers.
18	Improving multi-core performance using mixed-cell cache architecture.
18	Worm-Bubble Flow Control.
15	Exploring high-performance and energy proportional interface for phase change memory systems.
14	Adaptive Reliability Chipkill Correct (ARCC).
12	TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network.
11	Skinflint DRAM system: Minimizing DRAM chip writes for low power.
11	Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling.
10	Illusionist: Transforming lightweight cores into aggressive cores on demand.
10	Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches.
9	Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model.
9	RECAP: A region-based cure for the common cold (cache).
8	Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons.
7	A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O.
7	A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments.
7	Two level bulk preload branch prediction.
7	High-speed formal verification of heterogeneous coherence hierarchies.
6	In-network traffic regulation for Transactional Memory.
5	A group-commit mechanism for ROB-based processors implementing the X86 ISA.
4	How to implement effective prediction and forwarding for fusable dynamic multicore architectures.
1	Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors.

2012¶

Cited by	Paper title
117	Computational sprinting.
109	Improving write operations in MLC phase change memory.
102	Balancing DRAM locality and parallelism in shared memory CMP systems.
92	The case for GPGPU spatial multitasking.
88	SCD: A scalable coherence directory with flexible sharer set encoding.
88	TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture.
79	CPU-assisted GPGPU on fused CPU-GPU architectures.
60	Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip.
58	Efficient scrub mechanisms for error-prone emerging memories.
54	Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips.
48	QuickIA: Exploring heterogeneous architectures on real prototypes.
47	Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs.
44	Dynamically heterogeneous cores through 3D resource pooling.
42	Design, integration and implementation of the DySER hardware accelerator into OpenSPARC.
41	Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications.
41	Staged Reads: Mitigating the impact of DRAM writes on DRAM reads.
39	AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture.
37	System-level implications of disaggregated memory.
28	MORSE: Multi-objective reconfigurable self-optimizing memory scheduler.
26	JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers.
26	Statistical performance comparisons of computers.
24	?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory.
22	Power balanced pipelines.
22	Network congestion avoidance through Speculative Reservation.
21	Decoupled dynamic cache segmentation.
20	MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets.
18	WEST: Cloning data cache behavior using Stochastic Traces.
18	Supporting efficient collective communication in NoCs.
18	Pacman: Tolerating asymmetric data races with unintrusive hardware.
16	Network within a network approach to create a scalable high-radix router microarchitecture.
15	Parabix: Boosting the efficiency of text processing on commodity processors.
15	Cache restoration for highly partitioned virtualized systems.
13	Accelerating business analytics applications.
12	Adaptive Set-Granular Cooperative Caching.
10	BulkSMT: Designing SMT processors for atomic-block execution.
7	Architectural support for synchronization-free deterministic parallel programming.
6	Flexible register management using reference counting.
4	BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks.
4	Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor.

2011¶

Cited by	Paper title
262	Relaxing non-volatility for fast and energy-efficient STT-RAM caches.
229	A quantitative performance analysis model for GPU architectures.
177	Dynamically Specialized Datapaths for energy efficient computing.
166	Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing.
154	Thread block compaction for efficient SIMT control flow.
140	I-CASH: Intelligently Coupled Array of SSD and HDD.
136	FREE-p: Protecting non-volatile memory against both hard and soft errors.
131	CHIPPER: A low-complexity bufferless deflection router.
110	Cuckoo directory: A scalable directory for many-core systems.
110	Beyond block I/O: Rethinking traditional storage primitives.
76	SolarCore: Solar energy driven multi-core architecture power management.
75	HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing.
73	Calvin: Deterministic or not? Free will to choose.
72	Shared last-level TLBs for chip multiprocessors.
68	CloudCache: Expanding and shrinking private caches.
62	Addressing system-level trimming issues in on-chip nanophotonic networks.
62	A case for guarded power gating for multi-core processors.
61	Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system.
60	Practical and secure PCM systems by online detection of malicious write streams.
56	Programming the cloud.
54	Archipelago: A polymorphic cache design for enabling robust near-threshold operation.
48	NUcache: An efficient multicore cache organization based on Next-Use distance.
45	A new server I/O architecture for high speed networks.
44	Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols.
44	Fast thread migration via cache working set prediction.
41	MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy.
39	Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism.
37	Efficient complex operators for irregular codes.
34	Abstraction and microarchitecture scaling in early-stage power modeling.
32	Bloom Filter Guided Transaction Scheduling.
30	Hardware/software techniques for DRAM thermal management.
30	Achieving uniform performance and maximizing throughput in the presence of heterogeneity.
30	Efficient data streaming with on-chip accelerators: Opportunities and challenges.
27	ACCESS: Smart scheduling for asymmetric cache CMPs.
27	Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors.
26	Power shifting in Thrifty Interconnection Network.
25	Data-triggered threads: Eliminating redundant computation.
22	Storage free confidence estimation for the TAGE branch predictor.
20	Offline symbolic analysis to infer Total Store Order.
19	HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor.
18	Checked Load: Architectural support for JavaScript type-checking on mobile processors.
9	Exploiting criticality to reduce bottlenecks in distributed uniprocessors.
8	Hardware/software-based diagnosis of load-store queues using expandable activity logs.
5	Fg-STP: Fine-Grain Single Thread Partitioning on Multicores.
5	Architectural framework for supporting operating system survivability.
3	MOPED: Orchestrating interprocess message data on CMPs.
3	Safe and efficient supervised memory systems.
0	How’s the parallel computing revolution going?

2010¶

Cited by	Paper title
386	Graphite: A distributed parallel simulator for multicores.
318	ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers.
229	Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing.
213	An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth.
191	High performance network virtualization with SR-IOV.
143	FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar.
131	Operating system support for overlapping-ISA heterogeneous multi-core architectures.
120	Application performance modeling in a virtualized environment.
115	Scalable architectural support for trusted software.
112	A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement.
103	Designing a processor from the ground up to allow voltage/reliability tradeoffs.
91	Interval simulation: Raising the level of abstraction in architectural simulation.
90	CHOP: Adaptive filter-based DRAM caching for CMP server platforms.
83	Towards scalable, energy-efficient, bus-based on-chip networks.
81	Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance.
59	Worth their watts? - an empirical study of datacenter servers.
42	Simple virtual channel allocation for high throughput and high frequency on-chip routers.
39	ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture.
32	A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems.
30	UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all.
29	DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance.
28	Explaining cache SER anomaly using DUE AVF measurement.
24	BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution.
22	Value Based BTB Indexing for indirect jump prediction.
19	IADVS: On-demand performance for interactive applications.
17	LiteTM: Reducing transactional state overhead.
16	SIF: Overcoming the limitations of SIMD devices via implicit permutation.
16	StimulusCache: Boosting performance of chip multiprocessors with excess cache.
16	Delay-Hiding energy management mechanisms for DRAM.
13	Exascale computing: The challenges and opportunities in the next decade.
10	COMIC++: A software SVM system for heterogeneous multicore accelerator clusters.
8	Handling branches in TLS systems with Multi-Path Execution.
6	High-Performance low-vcc in-order core.
5	HARE: Hardware assisted reverse execution.
5	DMA++: on the fly data realignment for on-chip memories.
4	LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores.
1	Architecting for power management: The IBM POWER7TMapproach.
0	Is hardware innovation over?
0	Extreme scale computing: Challenges and opportunities.

2009¶

Cited by	Paper title
348	A novel architecture of the 3D stacked MRAM L2 cache for CMPs.
183	Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs.
170	Express Cube Topologies for on-Chip Interconnects.
123	Adaptive Spill-Receive for robust high-performance caching in CMPs.
119	Variation-aware dynamic voltage/frequency scaling.
106	Elastic-buffer flow control for on-chip networks.
95	Eliminating microarchitectural dependency from Architectural Vulnerability.
95	Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches.
94	Prediction router: Yet another low latency on-chip router architecture.
89	PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches.
89	Accurate microarchitecture-level fault modeling for studying hardware faults.
88	Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems.
86	A low-radix and low-diameter 3D interconnection network design.
86	CAMP: A technique to estimate per-structure power at run-time using a few simple parameters.
84	Blueshift: Designing processors for timing speculation from the ground up.
76	A first-order fine-grained multithreaded throughput model.
73	Bridging the computation gap between programmable processors and hardwired accelerators.
66	Voltage emergency prediction: Using signatures to reduce operating margins.
65	Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy.
65	In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects.
63	Hardware-software integrated approaches to defend against software cache-based side channel attacks.
56	MRR: Enabling fully adaptive multicast routing for CMP interconnection networks.
56	Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics.
52	iCFP: Tolerating all-level cache misses in in-order processors.
49	Design and implementation of software-managed caches for multicores with local memory.
47	Dacota: Post-silicon validation of the memory subsystem in multi-core designs.
36	An intelligent IT infrastructure for the future.
34	Reconciling specialization and flexibility through compound circuits.
34	Fast complete memory consistency verification.
33	Characterization of Direct Cache Access on multi-core systems and 10GbE.
28	Practical off-chip meta-data for temporal memory streaming.
23	Architectural Contesting.
19	Feedback mechanisms for improving probabilistic memory prefetching.
19	Soft error vulnerability aware process variation mitigation.
16	Criticality-based optimizations for efficient load processing.
3	Lightweight predication support for out of order processors.
2	Opportunities beyond single-core microprocessors.
0	Industrial perspectives panel.
0	Multi-core demands multi-interfaces.

2008¶

Cited by	Paper title
1235	Amdahl’s Law in the multicore era.
617	System level analysis of fast, per-core DVFS using on-chip switching regulators.
336	Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems.
315	Regional congestion awareness for load balance in networks-on-chip.
231	CMP network-on-chip overlaid with multi-band RF-interconnect.
211	Cluster-level feedback power control for performance optimization.
138	FlexiTaint: A programmable accelerator for dynamic taint propagation.
132	A comprehensive approach to DRAM power management.
107	C-Oracle: Predictive thermal management for data centers.
103	Uncovering hidden loop level parallelism in sequential applications.
84	Performance and power optimization through data compression in Network-on-Chip architectures.
74	DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors.
65	An OS-based alternative to full hardware coherence on tiled CMPs.
64	Automated microprocessor stressmark generation.
49	Design and implementation of the blue gene/P snoop filter.
47	Thread-safe dynamic binary translation using transactional memory.
42	EXCES: External caching in energy saving storage systems.
38	Runtime validation of memory ordering using constraint graph checking.
34	Supporting highly-decoupled thread-level redundancy for parallel programs.
31	Fundamental performance constraints in horizontal fusion of in-order cores.
27	Single-level integrity and confidentiality protection for distributed shared memory multiprocessors.
26	High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation.
26	Power-Efficient DRAM Speculation.
24	Address-branch correlation: A novel locality for long-latency hard-to-predict branches.
23	Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation.
21	Runahead Threads to improve SMT performance.
20	Roughness of microarchitectural design topologies and its implications for optimization.
18	Prediction of CPU idle-busy activity pattern.
14	PEEP: Exploiting predictability of memory dependences in SMT processors.
12	PaCo: Probability-based path confidence prediction.
10	Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again.
10	Speculative instruction validation for performance-reliability trade-off.
8	Performance-aware speculation control using wrong path usefulness prediction.
5	Fabric convergence implications on systems architecture.
4	Branch-mispredict level parallelism (BLP) for control independence.
0	Intel’s Tera-scale Computing Project: The first five years, the next five years.
0	Compilers and parallel computing systems.

2007¶

Cited by	Paper title
1022	Evaluating MapReduce for Multi-core and Multiprocessor Systems.
373	LogTM-SE: Decoupling Hardware Transactional Memory from Caches.
243	Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers.
190	Concurrent Direct Network Access for Virtual Machine Monitors.
159	Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors.
147	A Scalable, Non-blocking Approach to Transactional Memory.
144	Application-Level Correctness and its Impact on Fault Tolerance.
141	An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors.
137	HARD: Hardware-Assisted Lockset-based Race Detection.
125	A Burst Scheduling Access Reordering Mechanism.
113	Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications.
105	Perturbation-based Fault Screening.
102	MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging.
97	Illustrative Design Space Studies with Microarchitectural Regression Models.
79	Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling.
65	A Memory-Level Parallelism Aware Fetch Policy for SMT Processors.
64	Interactions Between Compression and Prefetching in Chip Multiprocessors.
64	Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines.
63	An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing.
57	Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat.
41	Colorama: Architectural Support for Data-Centric Synchronization.
40	Accelerating and Adapting Precomputation Threads for Effcient Prefetching.
39	Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping.
38	A Domain-Specific On-Chip Network Design for Large Scale Cache Systems.
36	Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures.
35	A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures.
25	Optical Interconnect Opportunities for Future Server Memory Systems.
24	Exploiting Postdominance for Speculative Parallelization.
21	Improving Branch Prediction and Predicated Execution in Out-of-Order Processors.
14	Implications of Device Timing Variability on Full Chip Timing.
4	Interconnect-Centric Computing.
3	Petascale Computing Research Challenges - A Manycore Perspective.

2006¶

Cited by	Paper title
770	LogTM: log-based transactional memory.
266	Dynamic power-performance adaptation of parallel computation on chip multiprocessors.
176	BulletProof: a defect-tolerant CMP switch architecture.
169	CMP design space exploration subject to physical constraints.
164	Construction and use of linear regression models for processor performance analysis.
143	Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads.
127	Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM.
125	The common case transactional behavior of multithreaded programs.
112	Phase characterization for power: evaluating control-flow-based and event-counter-based techniques.
91	CORD: cost-effective (and nearly overhead-free) order-recording and data race detection.
86	DMA-aware memory energy management.
86	Exploiting parallelism and structure to accelerate the simulation of chip multi-processors.
86	ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers.
77	Understanding the performance-temperature interactions in disk I/O of server workloads.
77	High performance file I/O for the Blue Gene/L supercomputer.
49	InfoShield: a security architecture for protecting information usage in memory.
45	An approach for implementing efficient superscalar CISC processors.
33	A decoupled KILO-instruction processor.
31	Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors.
28	Increasing the cache efficiency by eliminating noise.
27	Efficient instruction schedulers for SMT processors.
23	Completely verifying memory consistency of test program executions.
21	Software-hardware cooperative memory disambiguation.
20	Store vectors for scalable memory dependence prediction and scheduling.
17	Probabilistic counter updates for predictor hysteresis and stratification.
12	Chip-multiprocessing and beyond.
5	Speculative synchronization and thread management for fine granularity threads.
2	Industrial Perspectives: Platform Design Challenges with Many cores.
0	Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth.
0	Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps.
0	New architectures for a new biology.

2005¶

Cited by	Paper title
616	Unbounded Transactional Memory.
589	Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture.
499	Power Efficient Processor Architecture and The Cell Processor.
395	The Soft Error Problem: An Architectural Perspective.
198	Chip Multithreading: Opportunities and Challenges.
183	Performance, Energy, and Thermal Considerations for SMT and CMP Architectures.
177	SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs.
136	A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks.
117	Transition Phase Classification and Prediction.
111	Characterizing and Comparing Prevailing Simulation Techniques.
104	Improving Multiple-CMP Systems Using Token Coherence.
91	A Performance Comparison of DRAM Memory System Optimizations for SMT Processors.
90	Checkpointed Early Load Retirement.
84	A Unified Compressed Memory Hierarchy.
81	Trends in High-Performance Processors.
78	Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors.
75	Distributing the Frontend for Temperature Reduction.
67	SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors.
64	On the Limits of Leakage Power Reduction in Caches.
61	Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors.
61	A Small, Fast and Low-Power Register File by Bit-Partitioning.
59	Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications.
54	Enterprise IT Trends and Implications for Architecture Research.
53	Microarchitectural Wire Management for Performance and Power in Partitioned Architectures.
53	An Efficient Programmable 10 Gigabit Ethernet Network Interface Card.
47	Scatter-Add in Data Parallel Architectures.
43	Heat Stroke: Power-Density-Based Denial of Service in SMT.
34	Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems.
28	Multithreaded Value Prediction.
27	Software Directed Issue Queue Power Reduction.
26	Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE.
26	Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses.
11	Tapping ZettaRAMTMfor Low-Power Memory Systems.
11	Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions.
0	The Future of Computer Architecture Research: An Industrial Perspective.