MICRO

All

Cited by Paper title Year
1501 McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. 2009
941 Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. 2006
603 An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. 2006
552 Die Stacking (3D) Microarchitecture. 2006
470 Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. 2009
465 Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. 2009
449 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. 2007
420 Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. 2007
370 LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. 2006
365 Flattened Butterfly Topology for On-Chip Networks. 2007
358 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. 2006
358 Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. 2007
335 Characterizing flash memory: anomalies, observations, and applications. 2009
334 Fair Queuing Memory Systems. 2006
334 Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance. 2009
321 Neural Acceleration for General-Purpose Approximate Programs. 2012
302 Leveraging Optical Technology in Future Bus-based Chip Multiprocessors. 2006
299 ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers. 2006
296 Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures. 2009
294 Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. 2010
285 Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. 2011
282 Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management. 2006
278 Automatic Thread Extraction with Decoupled Software Pipelining. 2005
253 Improving GPU performance via large warps and two-level warp scheduling. 2011
247 Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. 2007
241 Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. 2007
237 ASR: Adaptive Selective Replication for CMP Caches. 2006
211 A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance. 2005
211 Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. 2008
210 Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? 2010
210 Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories. 2010
209 Architectural Support for Software Transactional Memory. 2006
206 Cache-Conscious Wavefront Scheduling. 2012
203 Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. 2008
197 A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs. 2007
192 Composable Lightweight Processors. 2007
185 Facelift: Hiding and slowing down aging in multicores. 2008
184 Penelope: The NBTI-Aware Processor. 2007
180 Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. 2006
179 Revisiting the Sequential Programming Model for Multi-Core. 2007
172 Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. 2007
170 Multi retention level STT-RAM cache designs with a dynamic refresh scheme. 2011
169 DaDianNao: A Machine-Learning Supercomputer. 2014
166 Reducing memory interference in multicore systems via application-aware memory channel partitioning. 2011
165 Application-aware prioritization mechanisms for on-chip networks. 2009
159 Reunion: Complexity-Effective Multicore Redundancy. 2006
157 FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. 2007
156 Stream Programming on General-Purpose Processors. 2005
153 Prefetch-Aware DRAM Controllers. 2008
148 Low-cost router microarchitecture for on-chip networks. 2009
148 SCARAB: a single cycle adaptive routing and bufferless network. 2009
147 Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. 2008
146 The ZCache: Decoupling Ways and Associativity. 2010
146 Pack&Cap: adaptive DVFS and thread packing under power caps. 2011
143 Understanding the Energy Consumption of Dynamic Random Access Memories. 2010
139 A tagless coherence directory. 2009
137 In-Network Cache Coherence. 2006
137 Copy or Discard execution model for speculative parallelization on multicores. 2008
136 Implementing Signatures for Transactional Memory. 2007
136 Approximate storage in solid-state memories. 2013
135 SAFER: Stuck-At-Fault Error Recovery for Memories. 2010
134 Improving cache lifetime reliability at ultra-low voltages. 2009
130 A Predictive Performance Model for Superscalar Processors. 2006
129 A novel cache architecture with enhanced performance and security. 2008
129 Coordinated control of multiple prefetchers in multi-core systems. 2009
129 SHiP: signature-based hit predictor for high performance caching. 2011
128 Mitigating the Impact of Process Variations on Processor Register Files and Execution Units. 2006
127 A Framework for Providing Quality of Service in Chip Multi-Processors. 2007
125 Transactional Memory Architecture and Implementation for IBM System Z. 2012
123 Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. 2011
123 Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. 2012
120 Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. 2009
118 EazyHTM: eager-lazy hardware transactional memory. 2009
118 Characterizing and mitigating the impact of process variations on phase change based memory systems. 2009
115 A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. 2005
115 Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. 2009
113 Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. 2008
113 Token flow control. 2008
113 Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip. 2009
112 Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing. 2007
112 QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores. 2011
111 CoScale: Coordinating CPU and Memory System DVFS in Server Systems. 2012
111 SAGE: self-tuning approximation for graphics engines. 2013
110 Yield-Aware Cache Architectures. 2006
108 Process Variation Tolerant 3T1D-Based Cache Architectures. 2007
107 Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation. 2007
107 Quality programmable vector processors for approximate computing. 2013
106 Sampling Dead Block Prediction for Last-Level Caches. 2010
105 Light speed arbitration and flow control for nanophotonic interconnects. 2009
104 Dependence-aware transactional memory for increased concurrency. 2008
104 Parallel application memory scheduling. 2011
102 Self-calibrating Online Wearout Detection. 2007
99 Leveraging 3D Technology for Improved Reliability. 2007
99 Complexity effective memory access scheduling for many-core accelerator architectures. 2009
99 Composite Cores: Pushing Heterogeneity Into a Core. 2012
97 Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. 2005
96 From SODA to scotch: The evolution of a wireless baseband processor. 2008
96 Task Superscalar: An Out-of-Order Task Pipeline. 2010
96 Active management of timing guardband to save energy in POWER7. 2011
94 A case for dynamic frequency tuning in on-chip networks. 2009
94 Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. 2010
92 Adaptive Caches: Effective Shaping of Cache Behavior to Workloads. 2006
92 Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly. 2007
91 The TM3270 Media-Processor. 2005
91 EVAL: Utilizing processors with variation-induced timing errors. 2008
90 Memory Prefetching Using Adaptive Stream Detection. 2006
90 Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. 2008
90 Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies. 2010
89 Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. 2009
88 SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. 2010
87 Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era. 2011
86 Bundled execution of recurring traces for energy-efficient general purpose processing. 2011
85 Efficient unicast and multicast support for CMPs. 2008
84 The BubbleWrap many-core: popping cores for sequential acceleration. 2009
83 Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory. 2010
82 Throughput-Effective On-Chip Networks for Manycore Accelerators. 2010
81 Finding concurrency bugs with context-aware communication graphs. 2009
80 The StageNet fabric for constructing resilient multicore systems. 2008
80 Meet the walkers: accelerating index traversals for in-memory databases. 2013
78 mSWAT: low-cost hardware fault detection and diagnosis for multicore systems. 2009
78 PACMan: prefetch-aware cache management for high performance caching. 2011
78 Kiln: closing the performance gap between systems with and without persistence support. 2013
77 A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation. 2005
77 Low Vccmin fault-tolerant cache with highly predictable performance. 2009
77 ZerehCache: armoring cache architectures in high defect density technologies. 2009
76 Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. 2006
75 Microarchitectural Design Space Exploration Using an Architecture-Centric Approach. 2007
75 Pay-As-You-Go: low-overhead hard-error correction for phase change memories. 2011
74 Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. 2009
73 Adaptive line placement with theset balancing cache. 2009
73 Improving Cache Management Policies Using Dynamic Reuse Distances. 2012
72 KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity. 2012
70 NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers. 2012
70 Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. 2012
69 Scalable Store-Load Forwarding via Store Queue Index Prediction. 2005
69 Improving memory bank-level parallelism in the presence of prefetching. 2009
69 Divergence-aware warp scheduling. 2013
68 RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization. 2013
67 ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. 2010
67 SIMD re-convergence at thread frontiers. 2011
66 NoSQ: Store-Load Communication without a Store Queue. 2006
66 Power reduction of CMP communication networks via RF-interconnects. 2008
66 Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. 2012
64 The Cell Processor Architecture. 2005
63 Tribeca: design for PVT variations with local recovery and fine-grained adaptation. 2009
63 A Dynamically Adaptable Hardware Transactional Memory. 2010
62 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. 2007
61 Shader Performance Analysis on a Modern GPU Architecture. 2005
61 Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. 2007
61 Preventing PCM banks from seizing too much power. 2011
60 Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. 2006
60 CPR: Composable performance regression for scalable multiprocessor models. 2008
60 Notary: Hardware techniques to enhance signatures. 2008
60 ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem. 2009
60 Hardware transactional memory for GPU architectures. 2011
60 Predicting Performance Impact of DVFS for Realistic Memory Systems. 2012
59 Coherence Ordering for Ring-based Chip Multiprocessors. 2006
59 Scavenger: A New Last Level Cache Architecture with Global Block Priority. 2007
59 Power to the people: Leveraging human physiological traits to control microprocessor frequency. 2008
59 MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP. 2012
59 Heterogeneous system coherence for integrated CPU-GPU systems. 2013
58 Fire-and-Forget: Load/Store Scheduling with No Store Queue at All. 2006
58 Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication. 2011
58 Architectural support for secure virtualization under a vulnerable hypervisor. 2011
57 Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities. 2005
57 Fairness and Throughput in Switch on Event Multithreading. 2006
56 SHARP control: controlled shared cache management in chip multiprocessors. 2009
55 Thermal Management of On-Chip Caches Through Power Density Minimization. 2005
55 Proactive transaction scheduling for contention management. 2009
54 Portable compiler optimisation across embedded programs and microarchitectures using machine learning. 2009
54 NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures. 2012
54 CoLT: Coalesced Large-Reach TLBs. 2012
53 A locality-aware memory hierarchy for energy-efficient GPU architectures. 2013
52 Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. 2005
52 Improving Region Selection in Dynamic Optimization Systems. 2005
52 Emulating Optimal Replacement with a Shepherd Cache. 2007
52 Online design bug detection: RTL analysis, flexible mechanisms, and evaluation. 2008
52 ReMAP: A Reconfigurable Heterogeneous Multicore Architecture. 2010
52 Spatiotemporal Coherence Tracking. 2012
51 Reconfigurable energy efficient near threshold cache architectures. 2008
51 Adaptive Cache Management for Energy-Efficient GPU Computing. 2014
50 Scalable Cache Miss Handling for High Memory-Level Parallelism. 2006
49 Address-Indexed Memory Disambiguation and Store-to-Load Forwarding. 2005
49 Tradeoffs in designing accelerator architectures for visual computing. 2008
49 Toward a multicore architecture for real-time ray-tracing. 2008
49 Execution leases: a hardware-supported mechanism for enforcing strong non-interference. 2009
49 In-network coherence filtering: snoopy coherence without broadcasts. 2009
49 BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. 2009
49 Combating Aging with the Colt Duty Cycle Equalizer. 2010
49 A Predictive Model for Dynamic Microarchitectural Adaptivity Control. 2010
49 Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. 2014
48 ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing. 2005
48 CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs. 2006
48 Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling. 2010
48 A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. 2012
47 Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units. 2006
47 NBTI tolerant microarchitecture design in the presence of process variation. 2008
47 Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric. 2010
47 A compile-time managed multi-level register file hierarchy. 2011
47 Linearly compressed pages: a low-complexity, low-latency main memory compression framework. 2013
45 Continuous Path and Edge Profiling. 2005
45 Token tenure: PATCHing token counting using directory-based cache coherence. 2008
45 Memory Latency Reduction via Thread Throttling. 2010
45 Fractal Coherence: Scalably Verifiable Cache Coherence. 2010
45 A new case for the TAGE branch predictor. 2011
45 FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory. 2012
45 Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. 2013
44 Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware. 2006
44 Temporal instruction fetch streaming. 2008
44 Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. 2012
44 SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. 2014
43 Reducing peak power with a table-driven adaptive processor core. 2009
42 “”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “ 2005
42 Offline symbolic analysis for multi-processor execution replay. 2009
42 Efficient Selection of Vector Instructions Using Dynamic Programming. 2010
42 A resistive TCAM accelerator for data-intensive computing. 2011
42 Rethinking DRAM Power Modes for Energy Proportionality. 2012
42 FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems. 2014
41 Adaptive data compression for high-performance low-power on-chip networks. 2008
41 Low-power, high-performance analog neural branch prediction. 2008
41 An hybrid eDRAM/SRAM macrocell to implement first-level data caches. 2009
40 A Criticality Analysis of Clustering in Superscalar Processors. 2005
40 Optimizing shared cache behavior of chip multiprocessors. 2009
40 Multiple clock and voltage domains for chip multi processors. 2009
40 Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs. 2010
40 AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection. 2010
40 Dataflow execution of sequential imperative programs on multicore architectures. 2011
40 Proactive instruction fetch. 2011
40 Managing GPU Concurrency in Heterogeneous Architectures. 2014
40 Load Value Approximation. 2014
39 NOC-Out: Microarchitecting a Scale-Out Processor. 2012
38 Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. 2005
38 Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. 2007
38 Adaptive Flow Control for Robust Performance and Energy. 2010
38 Register Cache System Not for Latency Reduction Purpose. 2010
38 Linearizing irregular memory accesses for improved correlated prefetching. 2013
37 Automatic Parallelization in a Binary Rewriter. 2010
37 Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches. 2010
37 Large-reach memory management unit caches. 2013
37 Multi-grain coherence directories. 2013
37 Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution. 2014
37 CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. 2014
36 Dataflow Predication. 2006
36 Support for High-Frequency Streaming in CMPs. 2006
36 Impact of Cache Coherence Protocols on the Processing of Network Traffic. 2007
36 Architecting a chunk-based memory race recorder in modern CMPs. 2009
36 The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory. 2015
35 Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. 2005
34 uComplexity: Estimating Processor Design Effort. 2005
34 Store Memory-Level Parallelism Optimizations for Commercial Applications. 2005
34 Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. 2007
34 Informed Microarchitecture Design Space Exploration Using Workload Dynamics. 2007
34 Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache. 2007
34 Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. 2008
34 Encore: low-cost, fine-grained transient fault recovery. 2011
33 DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage. 2009
33 PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration. 2014
32 Characterizing the resource-sharing levels in the UltraSPARC T2 processor. 2009
32 Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. 2010
32 Packet chaining: efficient single-cycle allocation for on-chip networks. 2011
32 Accurate Fine-Grained Processor Power Proxies. 2012
32 Warped gates: gating aware scheduling and power gating for GPGPUs. 2013
31 LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support. 2010
31 System-level integrated server architectures for scale-out datacenters. 2011
31 Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization. 2012
31 A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events. 2014
31 Random Fill Cache Architecture. 2014
30 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. 2005
30 PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection. 2006
30 Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. 2012
30 Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. 2013
30 Transparent Hardware Management of Stacked DRAM as Part of Memory. 2014
30 PORPLE: An Extensible Optimizer for Portable Data Placement on GPU. 2014
29 How to Fake 1000 Registers. 2005
29 Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology. 2008
29 Scalable Speculative Parallelization on Commodity Clusters. 2010
29 Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. 2012
29 Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. 2014
28 Authentication Control Point and Its Implications For Secure Processor Design. 2006
28 The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration. 2007
28 A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags. 2008
28 A performance-correctness explicitly-decoupled architecture. 2008
28 Light64: lightweight hardware support for data race detection during systematic testing of parallel programs. 2009
27 Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths. 2006
27 Tolerating Concurrency Bugs Using Transactions as Lifeguards. 2010
27 Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks. 2010
27 Accelerating microprocessor silicon validation by exposing ISA diversity. 2011
27 CoreRacer: a practical memory race recorder for multicore x86 TSO processors. 2011
27 Formally enhanced runtime verification to ensure NoC functional correctness. 2011
27 Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits. 2011
27 Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator. 2012
27 Warped-DMR: Light-weight Error Detection for GPGPU. 2012
27 Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. 2014
26 Merging Head and Tail Duplication for Convergent Hyperblock Formation. 2006
26 Shapeshifter: Dynamically changing pipeline width and speed to address process variations. 2008
26 Control flow obfuscation with information flow tracking. 2009
26 Ordering decoupled metadata accesses in multiprocessors. 2009
26 STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches. 2010
26 Insertion and promotion for tree-based PseudoLRU last-level caches. 2013
26 Trace based phase prediction for tightly-coupled heterogeneous cores. 2013
26 Locality-Aware Mapping of Nested Parallel Patterns on GPUs. 2014
26 CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware. 2014
26 Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses. 2015
25 Exploiting Vector Parallelism in Software Pipelined Loops. 2005
25 Variation-tolerant non-uniform 3D cache management in die stacked multicore processor. 2009
25 Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance. 2012
25 AUDIT: Stress Testing the Automatic Way. 2012
25 The reuse cache: downsizing the shared last-level cache. 2013
25 Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. 2014
24 Effective Optimistic-Checker Tandem Core Design through Architectural Pruning. 2007
24 AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors. 2010
24 Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors. 2010
24 FeatherWeight: low-cost optical arbitration with QoS support. 2011
24 Enabling datacenter servers to scale out economically and sustainably. 2013
24 uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults. 2013
24 TLC: a tag-less cache for reducing dynamic first level cache energy. 2013
23 ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment. 2010
22 Adaptive and Speculative Slack Simulations of CMPs on CMPs. 2010
22 Hardware Support for Relaxed Concurrency Control in Transactional Memory. 2010
22 Idempotent processor architecture. 2011
22 Identifying and predicting timing-critical instructions to boost timing speculation. 2011
21 Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines. 2005
21 A microarchitecture-based framework for pre- and post-silicon power delivery analysis. 2009
21 Addressing End-to-End Memory Access Latency in NoC-Based Multicores. 2012
20 A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design. 2006
20 Global Multi-Threaded Instruction Scheduling. 2007
20 Erasing Core Boundaries for Robust and Configurable Performance. 2010
20 RDIP: return-address-stack directed instruction prefetching. 2013
20 Crank it up or dial it down: coordinated multiprocessor frequency and folding control. 2013
20 Skewed Compressed Caches. 2014
20 ThyNVM: enabling software-transparent crash consistency in persistent memory systems. 2015
19 Doppelgänger: a cache for approximate computing. 2015
19 Verification of chip multiprocessor memory systems using a relaxed scoreboard. 2008
19 Implementing high availability memory with a duplication cache. 2008
19 Evaluating the effects of cache redundancy on profit. 2008
19 Architectural Support for Fair Reader-Writer Locking. 2010
19 The NoX router. 2011
19 Vector Extensions for Decision Support DBMS Acceleration. 2012
19 Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. 2012
19 Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory. 2013
19 Use it or lose it: wear-out and lifetime in future chip multiprocessors. 2013
19 Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults. 2014
19 Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks. 2014
19 Futility Scaling: High-Associativity Cache Partitioning. 2014
19 Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. 2014
18 DMDC: Delayed Memory Dependence Checking through Age-Based Filtering. 2006
18 Virtually Pipelined Network Memory. 2006
18 Strategies for mapping dataflow blocks to distributed hardware. 2008
18 Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. 2010
18 Resilient microring resonator based photonic networks. 2011
18 Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists. 2014
18 Neural acceleration for GPU throughput processors. 2015
18 Jump over ASLR: Attacking branch predictors to bypass ASLR. 2016
17 Manager-client pairing: a framework for implementing coherence hierarchies. 2011
16 Serialization-Aware Mini-Graphs: Performance with Fewer Resources. 2006
16 Time Interpolation: So Many Metrics, So Few Registers. 2007
16 Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models. 2014
16 Harnessing Soft Computations for Low-Budget Fault Tolerance. 2014
16 Large pages and lightweight memory management in virtualized environments: can you have it both ways? 2015
15 Testudo: Heavyweight security analysis via statistical sampling. 2008
15 SHARK: Architectural support for autonomic protection against stealth by rootkit exploits. 2008
15 A systematic methodology to develop resilient cache coherence protocols. 2011
15 A data layout optimization framework for NUCA-based multicores. 2011
15 Inferred Models for Dynamic and Sparse Hardware-Software Spaces. 2012
15 Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. 2012
15 BuMP: Bulk Memory Access Prediction and Streaming. 2014
14 CCICheck: usingµhb graphs to verify the coherence-consistency interface. 2015
14 Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows. 2005
14 Optimal versus Heuristic Global Code Scheduling. 2007
14 InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing. 2010
14 Virtual Snooping: Filtering Snoops in Virtualized Multi-cores. 2010
14 SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads. 2012
14 SHIFT: shared history instruction fetch for lean-core server processors. 2013
14 Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities. 2014
14 PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. 2014
13 Using a configurable processor generator for computer architecture prototyping. 2009
13 POWER7 multi-core processor design. 2009
13 Energy efficient GPU transactional memory via space-time optimizations. 2013
13 Imbalanced cache partitioning for balanced data-parallel programs. 2013
13 Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures. 2014
13 Multi-GPU System Design with Memory Networks. 2014
13 Arbitrary Modulus Indexing. 2014
13 Efficient persist barriers for multicores. 2015
12 A register-file approach for row buffer caches in die-stacked DRAMs. 2011
12 NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? 2014
12 Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. 2014
12 Architectural Specialization for Inter-Iteration Loop Dependence Patterns. 2014
12 Free launch: optimizing GPU dynamic kernel launches through thread reuse. 2015
12 Enabling interposer-based disintegration of multi-core processors. 2015
11 Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection. 2006
11 Memory Protection through Dynamic Access Control. 2006
11 Complementing user-level coarse-grain parallelism with implicit speculative parallelism. 2011
11 The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults. 2012
11 DESC: energy-efficient data exchange using synchronized counters. 2013
11 Efficient multiprogramming for multicores with SCAF. 2013
11 B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors. 2014
11 Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. 2015
11 Efficiently prefetching complex address patterns. 2015
10 The Future Evolution of High-Performance Microprocessors. 2005
10 Efficient Use of Invisible Registers in Thumb Code. 2005
10 Tree register allocation. 2009
10 MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP. 2013
10 Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems. 2014
10 Avoiding information leakage in the memory controller with fixed service policies. 2015
10 A scalable architecture for ordered parallelism. 2015
10 A cloud-scale acceleration architecture. 2016
9 A distributed processor state management architecture for large-window processors. 2008
9 ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer. 2011
9 Predicting Coherence Communication by Tracking Synchronization Points at Run Time. 2012
9 Efficient management of last-level caches in graphics processors for 3D scene rendering workloads. 2013
9 Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. 2013
9 Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration. 2014
9 RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. 2014
9 Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. 2015
9 Fast support for unstructured data processing: the unified automata processor. 2015
9 Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. 2015
8 Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects. 2013
8 Dodec: Random-Link, Low-Radix On-Chip Networks. 2014
8 Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors. 2014
8 GPU register file virtualization. 2015
8 Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches. 2015
8 Coherence domain restriction on large scale systems. 2015
8 Efficient GPU synchronization without scopes: saying no to complex consistency models. 2015
8 Rubik: fast analytical power management for latency-critical systems. 2015
8 Delegated persist ordering. 2016
7 Control-Flow Decoupling. 2012
7 A Front-End Execution Architecture for High Energy Efficiency. 2014
7 Short-Circuiting Memory Traffic in Handheld Platforms. 2014
7 Execution Drafting: Energy Efficiency through Computation Deduplication. 2014
7 Improving DRAM latency with dynamic asymmetric subarray. 2015
7 The inner most loop iteration counter: a new dimension in branch history. 2015
7 TimeTrader: exploiting latency tail to save datacenter energy for online search. 2015
7 Fork path: improving efficiency of ORAM by removing redundant memory accesses. 2015
7 IMP: indirect memory prefetcher. 2015
7 Stripes: Bit-serial deep neural network computing. 2016
6 Incremental Commit Groups for Non-Atomic Trace Processing. 2005
6 Architecture-aware automatic computation offload for native applications. 2015
6 Border control: sandboxing accelerators. 2015
6 Microarchitectural implications of event-driven server-side web applications. 2015
6 Efficient warp execution in presence of divergence with collaborative context collection. 2015
6 Characterizing, modeling, and improving the QoE of mobile devices with low battery level. 2015
5 Data-Dependency Graph Transformations for Superblock Scheduling. 2006
5 TransCom: transforming stream communication for load balance and efficiency in networks-on-chip. 2011
5 Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem. 2012
5 Compiler Support for Optimizing Memory Bank-Level Parallelism. 2014
5 Wormhole: Wisely Predicting Multidimensional Branches. 2014
5 Loop-Aware Memory Prefetching Using Code Block Working Sets. 2014
5 The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU. 2015
5 An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors. 2015
5 Prediction-guided performance-energy trade-off for interactive applications. 2015
5 Continuous runahead: Transparent hardware acceleration for memory intensive workloads. 2016
5 Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. 2016
4 Why design must change: rethinking digital design. 2009
4 GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. 2014
4 Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach. 2015
4 A fast and accurate analytical technique to compute the AVF of sequential bits in a processor. 2015
4 Efficiently enforcing strong memory ordering in GPUs. 2015
4 Authenticache: harnessing cache ECC for system authentication. 2015
4 Execution time prediction for energy-efficient hardware accelerators. 2015
4 Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection. 2016
4 Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. 2016
4 Co-designing accelerators and SoC interfaces using gem5-Aladdin. 2016
4 Improving bank-level parallelism for irregular applications. 2016
4 KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. 2016
3 SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations. 2012
3 Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection. 2013
3 DeSC: decoupled supply-compute communication management for heterogeneous architectures. 2015
3 HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. 2015
3 Locking down insecure indirection with hardware-based control-data isolation. 2015
3 Modeling the implications of DRAM failures and protection techniques on datacenter TCO. 2015
3 More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes. 2015
3 Fused-layer CNN accelerators. 2016
3 Towards efficient server architecture for virtualized network function deployment: Implications and implementations. 2016
3 Racer: TSO consistency via race detection. 2016
2 Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins. 2008
2 CRAM: coded registers for amplified multiporting. 2011
2 Allocating rotating registers by scheduling. 2013
2 Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes. 2013
2 Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations. 2014
2 Continuous, Low Overhead, Run-Time Validation of Program Executions. 2014
2 Bias-Free Branch Predictor. 2014
2 Bungee jumps: accelerating indirect branches through HW/SW co-design. 2015
2 Adaptive guardband scheduling to improve system-level efficiency of the POWER7+. 2015
2 MORC: a manycore-oriented compressed cache. 2015
2 CLEAN-ECC: high reliability ECC for adaptive granularity memory system. 2015
2 DynaMOS: dynamic schedule migration for heterogeneous cores. 2015
2 Self-contained, accurate precomputation prefetching. 2015
2 Confluence: unified instruction supply for scale-out servers. 2015
2 Filtered runahead execution with a runahead buffer. 2015
2 SABRes: Atomic object reads for in-memory rack-scale computing. 2016
2 Cambricon-X: An accelerator for sparse neural networks. 2016
2 Efficient kernel synthesis for performance portable programming. 2016
2 Chainsaw: Von-neumann accelerators to leverage fused instruction chains. 2016
2 Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach. 2016
2 Spectral profiling: Observer-effect-free profiling by monitoring EM emanations. 2016
2 From high-level deep neural models to FPGAs. 2016
1 Microarchitecture in the system-level integration era. 2008
1 BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment. 2013
1 COMP: Compiler Optimizations for Manycore Processors. 2014
1 SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. 2015
1 vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments. 2015
1 WarpPool: sharing requests with inter-warp coalescing for throughput processors. 2015
1 Enabling portable energy efficiency with memory accelerated library. 2015
1 DCS: a fast and scalable device-centric server architecture. 2015
1 Long term parking (LTP): criticality-aware resource allocation in OOO processors. 2015
1 A unified memory network architecture for in-memory computing in commodity servers. 2016
1 Path confidence based lookahead prefetching. 2016
1 Ti-states: Processor power management in the temperature inversion region. 2016
1 Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. 2016
1 Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. 2016
1 An ultra low-power hardware accelerator for automatic speech recognition. 2016
1 HARE: Hardware accelerator for regular expressions. 2016
1 Evaluating programmable architectures for imaging and vision applications. 2016
1 Lazy release consistency for GPUs. 2016
1 Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting. 2016
1 Quantifying and improving the efficiency of hardware-based mobile malware detectors. 2016
1 vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. 2016
1 Perceptron learning for reuse prediction. 2016
1 NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. 2016
1 C3D: Mitigating the NUMA bottleneck via coherent DRAM caches. 2016
1 OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. 2016
0 Message from the General Chairs. 2005
0 Message from the Program Co-Chairs. 2005
0 Control flow coalescing on a hybrid dataflow/von Neumann GPGPU. 2015
0 Ultra-low power render-based collision detection for CPU/GPU systems. 2015
0 Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks. 2016
0 pTask: A smart prefetching scheme for OS intensive applications. 2016
0 MIMD synchronization on SIMT architectures. 2016
0 Redefining QoS and customizing the power management policy to satisfy individual mobile users. 2016
0 Contention-based congestion management in large-scale networks. 2016
0 PoisonIvy: Safe speculation for secure memory. 2016
0 The Bunker Cache for spatio-value approximation. 2016
0 Register sharing for equality prediction. 2016
0 CrystalBall: Statically analyzing runtime behavior via deep sequence learning. 2016
0 ReplayConfusion: Detecting cache-based covert channel attacks using record and replay. 2016
0 Dynamic error mitigation in NoCs using intelligent prediction techniques. 2016
0 Zorua: A holistic approach to resource virtualization in GPUs. 2016
0 A patch memory system for image processing and computer vision. 2016
0 Improving energy efficiency of DRAM by exploiting half page row access. 2016
0 Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. 2016
0 The microarchitecture of a real-time robot motion planning accelerator. 2016
0 CANDY: Enabling coherent DRAM caches for multi-node systems. 2016
0 GRAPE: Minimizing energy for GPU applications with performance requirements. 2016
0 Exploiting semantic commutativity in hardware speculation. 2016
0 Dictionary sharing: An efficient cache compression scheme for compressed caches. 2016
0 Data-centric execution of speculative parallel programs. 2016
0 NeSC: Self-virtualizing nested storage controller. 2016
0 Reducing data movement energy via online data clustering and encoding. 2016
0 Keynotes: Internet of Things: History and hype, technology and policy. 2016
0 Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation. 2016

2016

Cited by Paper title
18 Jump over ASLR: Attacking branch predictors to bypass ASLR.
10 A cloud-scale acceleration architecture.
8 Delegated persist ordering.
7 Stripes: Bit-serial deep neural network computing.
5 Continuous runahead: Transparent hardware acceleration for memory intensive workloads.
5 Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency.
4 Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection.
4 Graphicionado: A high-performance and energy-efficient accelerator for graph analytics.
4 Co-designing accelerators and SoC interfaces using gem5-Aladdin.
4 Improving bank-level parallelism for irregular applications.
4 KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism.
3 Fused-layer CNN accelerators.
3 Towards efficient server architecture for virtualized network function deployment: Implications and implementations.
3 Racer: TSO consistency via race detection.
2 SABRes: Atomic object reads for in-memory rack-scale computing.
2 Cambricon-X: An accelerator for sparse neural networks.
2 Efficient kernel synthesis for performance portable programming.
2 Chainsaw: Von-neumann accelerators to leverage fused instruction chains.
2 Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach.
2 Spectral profiling: Observer-effect-free profiling by monitoring EM emanations.
2 From high-level deep neural models to FPGAs.
1 A unified memory network architecture for in-memory computing in commodity servers.
1 Path confidence based lookahead prefetching.
1 Ti-states: Processor power management in the temperature inversion region.
1 Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs.
1 Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems.
1 An ultra low-power hardware accelerator for automatic speech recognition.
1 HARE: Hardware accelerator for regular expressions.
1 Evaluating programmable architectures for imaging and vision applications.
1 Lazy release consistency for GPUs.
1 Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting.
1 Quantifying and improving the efficiency of hardware-based mobile malware detectors.
1 vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design.
1 Perceptron learning for reuse prediction.
1 NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints.
1 C3D: Mitigating the NUMA bottleneck via coherent DRAM caches.
1 OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures.
0 Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks.
0 pTask: A smart prefetching scheme for OS intensive applications.
0 MIMD synchronization on SIMT architectures.
0 Redefining QoS and customizing the power management policy to satisfy individual mobile users.
0 Contention-based congestion management in large-scale networks.
0 PoisonIvy: Safe speculation for secure memory.
0 The Bunker Cache for spatio-value approximation.
0 Register sharing for equality prediction.
0 CrystalBall: Statically analyzing runtime behavior via deep sequence learning.
0 ReplayConfusion: Detecting cache-based covert channel attacks using record and replay.
0 Dynamic error mitigation in NoCs using intelligent prediction techniques.
0 Zorua: A holistic approach to resource virtualization in GPUs.
0 A patch memory system for image processing and computer vision.
0 Improving energy efficiency of DRAM by exploiting half page row access.
0 Efficient data supply for hardware accelerators with prefetching and access/execute decoupling.
0 The microarchitecture of a real-time robot motion planning accelerator.
0 CANDY: Enabling coherent DRAM caches for multi-node systems.
0 GRAPE: Minimizing energy for GPU applications with performance requirements.
0 Exploiting semantic commutativity in hardware speculation.
0 Dictionary sharing: An efficient cache compression scheme for compressed caches.
0 Data-centric execution of speculative parallel programs.
0 NeSC: Self-virtualizing nested storage controller.
0 Reducing data movement energy via online data clustering and encoding.
0 Keynotes: Internet of Things: History and hype, technology and policy.
0 Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation.

2015

Cited by Paper title
36 The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.
26 Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.
20 ThyNVM: enabling software-transparent crash consistency in persistent memory systems.
19 Doppelgänger: a cache for approximate computing.
18 Neural acceleration for GPU throughput processors.
16 Large pages and lightweight memory management in virtualized environments: can you have it both ways?
14 CCICheck: usingµhb graphs to verify the coherence-consistency interface.
13 Efficient persist barriers for multicores.
12 Free launch: optimizing GPU dynamic kernel launches through thread reuse.
12 Enabling interposer-based disintegration of multi-core processors.
11 Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance.
11 Efficiently prefetching complex address patterns.
10 Avoiding information leakage in the memory controller with fixed service policies.
10 A scalable architecture for ordered parallelism.
9 Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems.
9 Fast support for unstructured data processing: the unified automata processor.
9 Enabling coordinated register allocation and thread-level parallelism optimization for GPUs.
8 GPU register file virtualization.
8 Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches.
8 Coherence domain restriction on large scale systems.
8 Efficient GPU synchronization without scopes: saying no to complex consistency models.
8 Rubik: fast analytical power management for latency-critical systems.
7 Improving DRAM latency with dynamic asymmetric subarray.
7 The inner most loop iteration counter: a new dimension in branch history.
7 TimeTrader: exploiting latency tail to save datacenter energy for online search.
7 Fork path: improving efficiency of ORAM by removing redundant memory accesses.
7 IMP: indirect memory prefetcher.
6 Architecture-aware automatic computation offload for native applications.
6 Border control: sandboxing accelerators.
6 Microarchitectural implications of event-driven server-side web applications.
6 Efficient warp execution in presence of divergence with collaborative context collection.
6 Characterizing, modeling, and improving the QoE of mobile devices with low battery level.
5 The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU.
5 An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors.
5 Prediction-guided performance-energy trade-off for interactive applications.
4 Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach.
4 A fast and accurate analytical technique to compute the AVF of sequential bits in a processor.
4 Efficiently enforcing strong memory ordering in GPUs.
4 Authenticache: harnessing cache ECC for system authentication.
4 Execution time prediction for energy-efficient hardware accelerators.
3 DeSC: decoupled supply-compute communication management for heterogeneous architectures.
3 HyComp: a hybrid cache compression method for selection of data-type-specific compression methods.
3 Locking down insecure indirection with hardware-based control-data isolation.
3 Modeling the implications of DRAM failures and protection techniques on datacenter TCO.
3 More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes.
2 Bungee jumps: accelerating indirect branches through HW/SW co-design.
2 Adaptive guardband scheduling to improve system-level efficiency of the POWER7+.
2 MORC: a manycore-oriented compressed cache.
2 CLEAN-ECC: high reliability ECC for adaptive granularity memory system.
2 DynaMOS: dynamic schedule migration for heterogeneous cores.
2 Self-contained, accurate precomputation prefetching.
2 Confluence: unified instruction supply for scale-out servers.
2 Filtered runahead execution with a runahead buffer.
1 SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers.
1 vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments.
1 WarpPool: sharing requests with inter-warp coalescing for throughput processors.
1 Enabling portable energy efficiency with memory accelerated library.
1 DCS: a fast and scalable device-centric server architecture.
1 Long term parking (LTP): criticality-aware resource allocation in OOO processors.
0 Control flow coalescing on a hybrid dataflow/von Neumann GPGPU.
0 Ultra-low power render-based collision detection for CPU/GPU systems.

2014

Cited by Paper title
169 DaDianNao: A Machine-Learning Supercomputer.
51 Adaptive Cache Management for Energy-Efficient GPU Computing.
49 Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache.
44 SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers.
42 FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.
40 Managing GPU Concurrency in Heterogeneous Architectures.
40 Load Value Approximation.
37 Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution.
37 CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache.
33 PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration.
31 A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events.
31 Random Fill Cache Architecture.
30 Transparent Hardware Management of Stacked DRAM as Part of Memory.
30 PORPLE: An Extensible Optimizer for Portable Data Placement on GPU.
29 Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks.
27 Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers.
26 Locality-Aware Mapping of Nested Parallel Patterns on GPUs.
26 CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware.
25 Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth.
20 Skewed Compressed Caches.
19 Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults.
19 Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks.
19 Futility Scaling: High-Associativity Cache Partitioning.
19 Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution.
18 Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists.
16 Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models.
16 Harnessing Soft Computations for Low-Budget Fault Tolerance.
15 BuMP: Bulk Memory Access Prediction and Streaming.
14 Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities.
14 PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research.
13 Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures.
13 Multi-GPU System Design with Memory Networks.
13 Arbitrary Modulus Indexing.
12 NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?
12 Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures.
12 Architectural Specialization for Inter-Iteration Loop Dependence Patterns.
11 B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors.
10 Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems.
9 Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration.
9 RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks.
8 Dodec: Random-Link, Low-Radix On-Chip Networks.
8 Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors.
7 A Front-End Execution Architecture for High Energy Efficiency.
7 Short-Circuiting Memory Traffic in Handheld Platforms.
7 Execution Drafting: Energy Efficiency through Computation Deduplication.
5 Compiler Support for Optimizing Memory Bank-Level Parallelism.
5 Wormhole: Wisely Predicting Multidimensional Branches.
5 Loop-Aware Memory Prefetching Using Code Block Working Sets.
4 GPUMech: GPU Performance Modeling Technique Based on Interval Analysis.
2 Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations.
2 Continuous, Low Overhead, Run-Time Validation of Program Executions.
2 Bias-Free Branch Predictor.
1 COMP: Compiler Optimizations for Manycore Processors.

2013

Cited by Paper title
136 Approximate storage in solid-state memories.
111 SAGE: self-tuning approximation for graphics engines.
107 Quality programmable vector processors for approximate computing.
80 Meet the walkers: accelerating index traversals for in-memory databases.
78 Kiln: closing the performance gap between systems with and without persistence support.
69 Divergence-aware warp scheduling.
68 RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.
59 Heterogeneous system coherence for integrated CPU-GPU systems.
53 A locality-aware memory hierarchy for energy-efficient GPU architectures.
47 Linearly compressed pages: a low-complexity, low-latency main memory compression framework.
45 Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching.
38 Linearizing irregular memory accesses for improved correlated prefetching.
37 Large-reach memory management unit caches.
37 Multi-grain coherence directories.
32 Warped gates: gating aware scheduling and power gating for GPGPUs.
30 Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device.
26 Insertion and promotion for tree-based PseudoLRU last-level caches.
26 Trace based phase prediction for tightly-coupled heterogeneous cores.
25 The reuse cache: downsizing the shared last-level cache.
24 Enabling datacenter servers to scale out economically and sustainably.
24 uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults.
24 TLC: a tag-less cache for reducing dynamic first level cache energy.
20 RDIP: return-address-stack directed instruction prefetching.
20 Crank it up or dial it down: coordinated multiprocessor frequency and folding control.
19 Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory.
19 Use it or lose it: wear-out and lifetime in future chip multiprocessors.
14 SHIFT: shared history instruction fetch for lean-core server processors.
13 Energy efficient GPU transactional memory via space-time optimizations.
13 Imbalanced cache partitioning for balanced data-parallel programs.
11 DESC: energy-efficient data exchange using synchronized counters.
11 Efficient multiprogramming for multicores with SCAF.
10 MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP.
9 Efficient management of last-level caches in graphics processors for 3D scene rendering workloads.
9 Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency.
8 Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects.
3 Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection.
2 Allocating rotating registers by scheduling.
2 Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes.
1 BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment.

2012

Cited by Paper title
321 Neural Acceleration for General-Purpose Approximate Programs.
206 Cache-Conscious Wavefront Scheduling.
125 Transactional Memory Architecture and Implementation for IBM System Z.
123 Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design.
111 CoScale: Coordinating CPU and Memory System DVFS in Server Systems.
99 Composite Cores: Pushing Heterogeneity Into a Core.
73 Improving Cache Management Policies Using Dynamic Reuse Distances.
72 KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity.
70 NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers.
70 Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor.
66 Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation.
60 Predicting Performance Impact of DVFS for Realistic Memory Systems.
59 MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP.
54 NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures.
54 CoLT: Coalesced Large-Reach TLBs.
52 Spatiotemporal Coherence Tracking.
48 A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch.
45 FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory.
44 Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy.
42 Rethinking DRAM Power Modes for Energy Proportionality.
39 NOC-Out: Microarchitecting a Scale-Out Processor.
32 Accurate Fine-Grained Processor Power Proxies.
31 Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization.
30 Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access.
29 Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically.
27 Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator.
27 Warped-DMR: Light-weight Error Detection for GPGPU.
25 Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance.
25 AUDIT: Stress Testing the Automatic Way.
21 Addressing End-to-End Memory Access Latency in NoC-Based Multicores.
19 Vector Extensions for Decision Support DBMS Acceleration.
19 Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks.
15 Inferred Models for Dynamic and Sparse Hardware-Software Spaces.
15 Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability.
14 SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads.
11 The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults.
9 Predicting Coherence Communication by Tracking Synchronization Points at Run Time.
7 Control-Flow Decoupling.
5 Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem.
3 SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations.

2011

Cited by Paper title
285 Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations.
253 Improving GPU performance via large warps and two-level warp scheduling.
170 Multi retention level STT-RAM cache designs with a dynamic refresh scheme.
166 Reducing memory interference in multicore systems via application-aware memory channel partitioning.
146 Pack&Cap: adaptive DVFS and thread packing under power caps.
129 SHiP: signature-based hit predictor for high performance caching.
123 Efficiently enabling conventional block sizes for very large die-stacked DRAM caches.
112 QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores.
104 Parallel application memory scheduling.
96 Active management of timing guardband to save energy in POWER7.
87 Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era.
86 Bundled execution of recurring traces for energy-efficient general purpose processing.
78 PACMan: prefetch-aware cache management for high performance caching.
75 Pay-As-You-Go: low-overhead hard-error correction for phase change memories.
67 SIMD re-convergence at thread frontiers.
61 Preventing PCM banks from seizing too much power.
60 Hardware transactional memory for GPU architectures.
58 Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication.
58 Architectural support for secure virtualization under a vulnerable hypervisor.
47 A compile-time managed multi-level register file hierarchy.
45 A new case for the TAGE branch predictor.
42 A resistive TCAM accelerator for data-intensive computing.
40 Dataflow execution of sequential imperative programs on multicore architectures.
40 Proactive instruction fetch.
34 Encore: low-cost, fine-grained transient fault recovery.
32 Packet chaining: efficient single-cycle allocation for on-chip networks.
31 System-level integrated server architectures for scale-out datacenters.
27 Accelerating microprocessor silicon validation by exposing ISA diversity.
27 CoreRacer: a practical memory race recorder for multicore x86 TSO processors.
27 Formally enhanced runtime verification to ensure NoC functional correctness.
27 Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits.
24 FeatherWeight: low-cost optical arbitration with QoS support.
22 Idempotent processor architecture.
22 Identifying and predicting timing-critical instructions to boost timing speculation.
19 The NoX router.
18 Resilient microring resonator based photonic networks.
17 Manager-client pairing: a framework for implementing coherence hierarchies.
15 A systematic methodology to develop resilient cache coherence protocols.
15 A data layout optimization framework for NUCA-based multicores.
12 A register-file approach for row buffer caches in die-stacked DRAMs.
11 Complementing user-level coarse-grain parallelism with implicit speculative parallelism.
9 ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer.
5 TransCom: transforming stream communication for load balance and efficiency in networks-on-chip.
2 CRAM: coded registers for amplified multiporting.

2010

Cited by Paper title
294 Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior.
210 Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
210 Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories.
146 The ZCache: Decoupling Ways and Associativity.
143 Understanding the Energy Consumption of Dynamic Random Access Memories.
135 SAFER: Stuck-At-Fault Error Recovery for Memories.
106 Sampling Dead Block Prediction for Last-Level Caches.
96 Task Superscalar: An Out-of-Order Task Pipeline.
94 Many-Thread Aware Prefetching Mechanisms for GPGPU Applications.
90 Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies.
88 SD3: A Scalable Approach to Dynamic Data-Dependence Profiling.
83 Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory.
82 Throughput-Effective On-Chip Networks for Manycore Accelerators.
67 ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory.
63 A Dynamically Adaptable Hardware Transactional Memory.
52 ReMAP: A Reconfigurable Heterogeneous Multicore Architecture.
49 Combating Aging with the Colt Duty Cycle Equalizer.
49 A Predictive Model for Dynamic Microarchitectural Adaptivity Control.
48 Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling.
47 Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric.
45 Memory Latency Reduction via Thread Throttling.
45 Fractal Coherence: Scalably Verifiable Cache Coherence.
42 Efficient Selection of Vector Instructions Using Dynamic Programming.
40 Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs.
40 AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection.
38 Adaptive Flow Control for Robust Performance and Energy.
38 Register Cache System Not for Latency Reduction Purpose.
37 Automatic Parallelization in a Binary Rewriter.
37 Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches.
32 Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors.
31 LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support.
29 Scalable Speculative Parallelization on Commodity Clusters.
27 Tolerating Concurrency Bugs Using Transactions as Lifeguards.
27 Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks.
26 STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches.
24 AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors.
24 Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors.
23 ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment.
22 Adaptive and Speculative Slack Simulations of CMPs on CMPs.
22 Hardware Support for Relaxed Concurrency Control in Transactional Memory.
20 Erasing Core Boundaries for Robust and Configurable Performance.
19 Architectural Support for Fair Reader-Writer Locking.
18 Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels.
14 InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing.
14 Virtual Snooping: Filtering Snoops in Virtualized Multi-cores.

2009

Cited by Paper title
1501 McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures.
470 Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping.
465 Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling.
335 Characterizing flash memory: anomalies, observations, and applications.
334 Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance.
296 Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures.
165 Application-aware prioritization mechanisms for on-chip networks.
148 Low-cost router microarchitecture for on-chip networks.
148 SCARAB: a single cycle adaptive routing and bufferless network.
139 A tagless coherence directory.
134 Improving cache lifetime reliability at ultra-low voltages.
129 Coordinated control of multiple prefetchers in multi-core systems.
120 Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches.
118 EazyHTM: eager-lazy hardware transactional memory.
118 Characterizing and mitigating the impact of process variations on phase change based memory systems.
115 Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems.
113 Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip.
105 Light speed arbitration and flow control for nanophotonic interconnects.
99 Complexity effective memory access scheduling for many-core accelerator architectures.
94 A case for dynamic frequency tuning in on-chip networks.
89 Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications.
84 The BubbleWrap many-core: popping cores for sequential acceleration.
81 Finding concurrency bugs with context-aware communication graphs.
78 mSWAT: low-cost hardware fault detection and diagnosis for multicore systems.
77 Low Vccmin fault-tolerant cache with highly predictable performance.
77 ZerehCache: armoring cache architectures in high defect density technologies.
74 Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy.
73 Adaptive line placement with theset balancing cache.
69 Improving memory bank-level parallelism in the presence of prefetching.
63 Tribeca: design for PVT variations with local recovery and fine-grained adaptation.
60 ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem.
56 SHARP control: controlled shared cache management in chip multiprocessors.
55 Proactive transaction scheduling for contention management.
54 Portable compiler optimisation across embedded programs and microarchitectures using machine learning.
49 Execution leases: a hardware-supported mechanism for enforcing strong non-interference.
49 In-network coherence filtering: snoopy coherence without broadcasts.
49 BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support.
43 Reducing peak power with a table-driven adaptive processor core.
42 Offline symbolic analysis for multi-processor execution replay.
41 An hybrid eDRAM/SRAM macrocell to implement first-level data caches.
40 Optimizing shared cache behavior of chip multiprocessors.
40 Multiple clock and voltage domains for chip multi processors.
36 Architecting a chunk-based memory race recorder in modern CMPs.
33 DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage.
32 Characterizing the resource-sharing levels in the UltraSPARC T2 processor.
28 Light64: lightweight hardware support for data race detection during systematic testing of parallel programs.
26 Control flow obfuscation with information flow tracking.
26 Ordering decoupled metadata accesses in multiprocessors.
25 Variation-tolerant non-uniform 3D cache management in die stacked multicore processor.
21 A microarchitecture-based framework for pre- and post-silicon power delivery analysis.
13 Using a configurable processor generator for computer architecture prototyping.
13 POWER7 multi-core processor design.
10 Tree register allocation.
4 Why design must change: rethinking digital design.

2008

Cited by Paper title
211 Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach.
203 Mini-rank: Adaptive DRAM architecture for improving memory power efficiency.
185 Facelift: Hiding and slowing down aging in multicores.
153 Prefetch-Aware DRAM Controllers.
147 Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency.
137 Copy or Discard execution model for speculative parallelization on multicores.
129 A novel cache architecture with enhanced performance and security.
113 Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer.
113 Token flow control.
104 Dependence-aware transactional memory for increased concurrency.
96 From SODA to scotch: The evolution of a wireless baseband processor.
91 EVAL: Utilizing processors with variation-induced timing errors.
90 Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence.
85 Efficient unicast and multicast support for CMPs.
80 The StageNet fabric for constructing resilient multicore systems.
66 Power reduction of CMP communication networks via RF-interconnects.
60 CPR: Composable performance regression for scalable multiprocessor models.
60 Notary: Hardware techniques to enhance signatures.
59 Power to the people: Leveraging human physiological traits to control microprocessor frequency.
52 Online design bug detection: RTL analysis, flexible mechanisms, and evaluation.
51 Reconfigurable energy efficient near threshold cache architectures.
49 Tradeoffs in designing accelerator architectures for visual computing.
49 Toward a multicore architecture for real-time ray-tracing.
47 NBTI tolerant microarchitecture design in the presence of process variation.
45 Token tenure: PATCHing token counting using directory-based cache coherence.
44 Temporal instruction fetch streaming.
41 Adaptive data compression for high-performance low-power on-chip networks.
41 Low-power, high-performance analog neural branch prediction.
34 Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.
29 Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology.
28 A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags.
28 A performance-correctness explicitly-decoupled architecture.
26 Shapeshifter: Dynamically changing pipeline width and speed to address process variations.
19 Verification of chip multiprocessor memory systems using a relaxed scoreboard.
19 Implementing high availability memory with a duplication cache.
19 Evaluating the effects of cache redundancy on profit.
18 Strategies for mapping dataflow blocks to distributed hardware.
15 Testudo: Heavyweight security analysis via statistical sampling.
15 SHARK: Architectural support for autonomic protection against stealth by rootkit exploits.
9 A distributed processor state management architecture for large-window processors.
2 Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins.
1 Microarchitecture in the system-level integration era.

2007

Cited by Paper title
449 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0.
420 Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors.
365 Flattened Butterfly Topology for On-Chip Networks.
358 Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow.
247 Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding.
241 Argus: Low-Cost, Comprehensive Error Detection in Simple Cores.
197 A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs.
192 Composable Lightweight Processors.
184 Penelope: The NBTI-Aware Processor.
179 Revisiting the Sequential Programming Model for Multi-Core.
172 Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs.
157 FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators.
136 Implementing Signatures for Transactional Memory.
127 A Framework for Providing Quality of Service in Chip Multi-Processors.
112 Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing.
108 Process Variation Tolerant 3T1D-Based Cache Architectures.
107 Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation.
102 Self-calibrating Online Wearout Detection.
99 Leveraging 3D Technology for Improved Reliability.
92 Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly.
75 Microarchitectural Design Space Exploration Using an Architecture-Centric Approach.
62 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy.
61 Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures.
59 Scavenger: A New Last Level Cache Architecture with Global Block Priority.
52 Emulating Optimal Replacement with a Shepherd Cache.
38 Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors.
36 Impact of Cache Coherence Protocols on the Processing of Network Traffic.
34 Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications.
34 Informed Microarchitecture Design Space Exploration Using Workload Dynamics.
34 Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache.
28 The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration.
24 Effective Optimistic-Checker Tandem Core Design through Architectural Pruning.
20 Global Multi-Threaded Instruction Scheduling.
16 Time Interpolation: So Many Metrics, So Few Registers.
14 Optimal versus Heuristic Global Code Scheduling.

2006

Cited by Paper title
941 Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches.
603 An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget.
552 Die Stacking (3D) Microarchitecture.
370 LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks.
358 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation.
334 Fair Queuing Memory Systems.
302 Leveraging Optical Technology in Future Bus-based Chip Multiprocessors.
299 ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers.
282 Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management.
237 ASR: Adaptive Selective Replication for CMP Caches.
209 Architectural Support for Software Transactional Memory.
180 Distributed Microarchitectural Protocols in the TRIPS Prototype Processor.
159 Reunion: Complexity-Effective Multicore Redundancy.
137 In-Network Cache Coherence.
130 A Predictive Performance Model for Superscalar Processors.
128 Mitigating the Impact of Process Variations on Processor Register Files and Execution Units.
110 Yield-Aware Cache Architectures.
92 Adaptive Caches: Effective Shaping of Cache Behavior to Workloads.
90 Memory Prefetching Using Adaptive Stream Detection.
76 Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers.
66 NoSQ: Store-Load Communication without a Store Queue.
60 Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions.
59 Coherence Ordering for Ring-based Chip Multiprocessors.
58 Fire-and-Forget: Load/Store Scheduling with No Store Queue at All.
57 Fairness and Throughput in Switch on Event Multithreading.
50 Scalable Cache Miss Handling for High Memory-Level Parallelism.
48 CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs.
47 Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units.
44 Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware.
36 Dataflow Predication.
36 Support for High-Frequency Streaming in CMPs.
30 PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection.
28 Authentication Control Point and Its Implications For Secure Processor Design.
27 Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths.
26 Merging Head and Tail Duplication for Convergent Hyperblock Formation.
20 A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design.
18 DMDC: Delayed Memory Dependence Checking through Age-Based Filtering.
18 Virtually Pipelined Network Memory.
16 Serialization-Aware Mini-Graphs: Performance with Fewer Resources.
11 Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection.
11 Memory Protection through Dynamic Access Control.
5 Data-Dependency Graph Transformations for Superblock Scheduling.

2005

Cited by Paper title
278 Automatic Thread Extraction with Decoupled Software Pipelining.
211 A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance.
156 Stream Programming on General-Purpose Processors.
115 A Mechanism for Online Diagnosis of Hard Faults in Microprocessors.
97 Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor.
91 The TM3270 Media-Processor.
77 A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation.
69 Scalable Store-Load Forwarding via Store Queue Index Prediction.
64 The Cell Processor Architecture.
61 Shader Performance Analysis on a Modern GPU Architecture.
57 Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities.
55 Thermal Management of On-Chip Caches Through Power Density Minimization.
52 Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution.
52 Improving Region Selection in Dynamic Optimization Systems.
49 Address-Indexed Memory Disambiguation and Store-to-Load Forwarding.
48 ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing.
45 Continuous Path and Edge Profiling.
42 “”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “
40 A Criticality Analysis of Clustering in Superscalar Processors.
38 Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors.
35 Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns.
34 uComplexity: Estimating Processor Design Effort.
34 Store Memory-Level Parallelism Optimizations for Commercial Applications.
30 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System.
29 How to Fake 1000 Registers.
25 Exploiting Vector Parallelism in Software Pipelined Loops.
21 Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines.
14 Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows.
10 The Future Evolution of High-Performance Microprocessors.
10 Efficient Use of Invisible Registers in Thumb Code.
6 Incremental Commit Groups for Non-Atomic Trace Processing.
0 Message from the General Chairs.
0 Message from the Program Co-Chairs.