MICRO¶
All¶
Cited by | Paper title | Year |
---|---|---|
1501 | McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. | 2009 |
941 | Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. | 2006 |
603 | An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. | 2006 |
552 | Die Stacking (3D) Microarchitecture. | 2006 |
470 | Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. | 2009 |
465 | Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. | 2009 |
449 | Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. | 2007 |
420 | Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. | 2007 |
370 | LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. | 2006 |
365 | Flattened Butterfly Topology for On-Chip Networks. | 2007 |
358 | Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. | 2006 |
358 | Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. | 2007 |
335 | Characterizing flash memory: anomalies, observations, and applications. | 2009 |
334 | Fair Queuing Memory Systems. | 2006 |
334 | Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance. | 2009 |
321 | Neural Acceleration for General-Purpose Approximate Programs. | 2012 |
302 | Leveraging Optical Technology in Future Bus-based Chip Multiprocessors. | 2006 |
299 | ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers. | 2006 |
296 | Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures. | 2009 |
294 | Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. | 2010 |
285 | Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. | 2011 |
282 | Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management. | 2006 |
278 | Automatic Thread Extraction with Decoupled Software Pipelining. | 2005 |
253 | Improving GPU performance via large warps and two-level warp scheduling. | 2011 |
247 | Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. | 2007 |
241 | Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. | 2007 |
237 | ASR: Adaptive Selective Replication for CMP Caches. | 2006 |
211 | A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance. | 2005 |
211 | Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. | 2008 |
210 | Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? | 2010 |
210 | Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories. | 2010 |
209 | Architectural Support for Software Transactional Memory. | 2006 |
206 | Cache-Conscious Wavefront Scheduling. | 2012 |
203 | Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. | 2008 |
197 | A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs. | 2007 |
192 | Composable Lightweight Processors. | 2007 |
185 | Facelift: Hiding and slowing down aging in multicores. | 2008 |
184 | Penelope: The NBTI-Aware Processor. | 2007 |
180 | Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. | 2006 |
179 | Revisiting the Sequential Programming Model for Multi-Core. | 2007 |
172 | Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. | 2007 |
170 | Multi retention level STT-RAM cache designs with a dynamic refresh scheme. | 2011 |
169 | DaDianNao: A Machine-Learning Supercomputer. | 2014 |
166 | Reducing memory interference in multicore systems via application-aware memory channel partitioning. | 2011 |
165 | Application-aware prioritization mechanisms for on-chip networks. | 2009 |
159 | Reunion: Complexity-Effective Multicore Redundancy. | 2006 |
157 | FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. | 2007 |
156 | Stream Programming on General-Purpose Processors. | 2005 |
153 | Prefetch-Aware DRAM Controllers. | 2008 |
148 | Low-cost router microarchitecture for on-chip networks. | 2009 |
148 | SCARAB: a single cycle adaptive routing and bufferless network. | 2009 |
147 | Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. | 2008 |
146 | The ZCache: Decoupling Ways and Associativity. | 2010 |
146 | Pack&Cap: adaptive DVFS and thread packing under power caps. | 2011 |
143 | Understanding the Energy Consumption of Dynamic Random Access Memories. | 2010 |
139 | A tagless coherence directory. | 2009 |
137 | In-Network Cache Coherence. | 2006 |
137 | Copy or Discard execution model for speculative parallelization on multicores. | 2008 |
136 | Implementing Signatures for Transactional Memory. | 2007 |
136 | Approximate storage in solid-state memories. | 2013 |
135 | SAFER: Stuck-At-Fault Error Recovery for Memories. | 2010 |
134 | Improving cache lifetime reliability at ultra-low voltages. | 2009 |
130 | A Predictive Performance Model for Superscalar Processors. | 2006 |
129 | A novel cache architecture with enhanced performance and security. | 2008 |
129 | Coordinated control of multiple prefetchers in multi-core systems. | 2009 |
129 | SHiP: signature-based hit predictor for high performance caching. | 2011 |
128 | Mitigating the Impact of Process Variations on Processor Register Files and Execution Units. | 2006 |
127 | A Framework for Providing Quality of Service in Chip Multi-Processors. | 2007 |
125 | Transactional Memory Architecture and Implementation for IBM System Z. | 2012 |
123 | Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. | 2011 |
123 | Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. | 2012 |
120 | Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. | 2009 |
118 | EazyHTM: eager-lazy hardware transactional memory. | 2009 |
118 | Characterizing and mitigating the impact of process variations on phase change based memory systems. | 2009 |
115 | A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. | 2005 |
115 | Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. | 2009 |
113 | Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. | 2008 |
113 | Token flow control. | 2008 |
113 | Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip. | 2009 |
112 | Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing. | 2007 |
112 | QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores. | 2011 |
111 | CoScale: Coordinating CPU and Memory System DVFS in Server Systems. | 2012 |
111 | SAGE: self-tuning approximation for graphics engines. | 2013 |
110 | Yield-Aware Cache Architectures. | 2006 |
108 | Process Variation Tolerant 3T1D-Based Cache Architectures. | 2007 |
107 | Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation. | 2007 |
107 | Quality programmable vector processors for approximate computing. | 2013 |
106 | Sampling Dead Block Prediction for Last-Level Caches. | 2010 |
105 | Light speed arbitration and flow control for nanophotonic interconnects. | 2009 |
104 | Dependence-aware transactional memory for increased concurrency. | 2008 |
104 | Parallel application memory scheduling. | 2011 |
102 | Self-calibrating Online Wearout Detection. | 2007 |
99 | Leveraging 3D Technology for Improved Reliability. | 2007 |
99 | Complexity effective memory access scheduling for many-core accelerator architectures. | 2009 |
99 | Composite Cores: Pushing Heterogeneity Into a Core. | 2012 |
97 | Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. | 2005 |
96 | From SODA to scotch: The evolution of a wireless baseband processor. | 2008 |
96 | Task Superscalar: An Out-of-Order Task Pipeline. | 2010 |
96 | Active management of timing guardband to save energy in POWER7. | 2011 |
94 | A case for dynamic frequency tuning in on-chip networks. | 2009 |
94 | Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. | 2010 |
92 | Adaptive Caches: Effective Shaping of Cache Behavior to Workloads. | 2006 |
92 | Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly. | 2007 |
91 | The TM3270 Media-Processor. | 2005 |
91 | EVAL: Utilizing processors with variation-induced timing errors. | 2008 |
90 | Memory Prefetching Using Adaptive Stream Detection. | 2006 |
90 | Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. | 2008 |
90 | Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies. | 2010 |
89 | Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. | 2009 |
88 | SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. | 2010 |
87 | Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era. | 2011 |
86 | Bundled execution of recurring traces for energy-efficient general purpose processing. | 2011 |
85 | Efficient unicast and multicast support for CMPs. | 2008 |
84 | The BubbleWrap many-core: popping cores for sequential acceleration. | 2009 |
83 | Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory. | 2010 |
82 | Throughput-Effective On-Chip Networks for Manycore Accelerators. | 2010 |
81 | Finding concurrency bugs with context-aware communication graphs. | 2009 |
80 | The StageNet fabric for constructing resilient multicore systems. | 2008 |
80 | Meet the walkers: accelerating index traversals for in-memory databases. | 2013 |
78 | mSWAT: low-cost hardware fault detection and diagnosis for multicore systems. | 2009 |
78 | PACMan: prefetch-aware cache management for high performance caching. | 2011 |
78 | Kiln: closing the performance gap between systems with and without persistence support. | 2013 |
77 | A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation. | 2005 |
77 | Low Vccmin fault-tolerant cache with highly predictable performance. | 2009 |
77 | ZerehCache: armoring cache architectures in high defect density technologies. | 2009 |
76 | Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. | 2006 |
75 | Microarchitectural Design Space Exploration Using an Architecture-Centric Approach. | 2007 |
75 | Pay-As-You-Go: low-overhead hard-error correction for phase change memories. | 2011 |
74 | Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. | 2009 |
73 | Adaptive line placement with theset balancing cache. | 2009 |
73 | Improving Cache Management Policies Using Dynamic Reuse Distances. | 2012 |
72 | KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity. | 2012 |
70 | NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers. | 2012 |
70 | Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. | 2012 |
69 | Scalable Store-Load Forwarding via Store Queue Index Prediction. | 2005 |
69 | Improving memory bank-level parallelism in the presence of prefetching. | 2009 |
69 | Divergence-aware warp scheduling. | 2013 |
68 | RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization. | 2013 |
67 | ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. | 2010 |
67 | SIMD re-convergence at thread frontiers. | 2011 |
66 | NoSQ: Store-Load Communication without a Store Queue. | 2006 |
66 | Power reduction of CMP communication networks via RF-interconnects. | 2008 |
66 | Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. | 2012 |
64 | The Cell Processor Architecture. | 2005 |
63 | Tribeca: design for PVT variations with local recovery and fine-grained adaptation. | 2009 |
63 | A Dynamically Adaptable Hardware Transactional Memory. | 2010 |
62 | A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. | 2007 |
61 | Shader Performance Analysis on a Modern GPU Architecture. | 2005 |
61 | Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. | 2007 |
61 | Preventing PCM banks from seizing too much power. | 2011 |
60 | Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. | 2006 |
60 | CPR: Composable performance regression for scalable multiprocessor models. | 2008 |
60 | Notary: Hardware techniques to enhance signatures. | 2008 |
60 | ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem. | 2009 |
60 | Hardware transactional memory for GPU architectures. | 2011 |
60 | Predicting Performance Impact of DVFS for Realistic Memory Systems. | 2012 |
59 | Coherence Ordering for Ring-based Chip Multiprocessors. | 2006 |
59 | Scavenger: A New Last Level Cache Architecture with Global Block Priority. | 2007 |
59 | Power to the people: Leveraging human physiological traits to control microprocessor frequency. | 2008 |
59 | MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP. | 2012 |
59 | Heterogeneous system coherence for integrated CPU-GPU systems. | 2013 |
58 | Fire-and-Forget: Load/Store Scheduling with No Store Queue at All. | 2006 |
58 | Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication. | 2011 |
58 | Architectural support for secure virtualization under a vulnerable hypervisor. | 2011 |
57 | Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities. | 2005 |
57 | Fairness and Throughput in Switch on Event Multithreading. | 2006 |
56 | SHARP control: controlled shared cache management in chip multiprocessors. | 2009 |
55 | Thermal Management of On-Chip Caches Through Power Density Minimization. | 2005 |
55 | Proactive transaction scheduling for contention management. | 2009 |
54 | Portable compiler optimisation across embedded programs and microarchitectures using machine learning. | 2009 |
54 | NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures. | 2012 |
54 | CoLT: Coalesced Large-Reach TLBs. | 2012 |
53 | A locality-aware memory hierarchy for energy-efficient GPU architectures. | 2013 |
52 | Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. | 2005 |
52 | Improving Region Selection in Dynamic Optimization Systems. | 2005 |
52 | Emulating Optimal Replacement with a Shepherd Cache. | 2007 |
52 | Online design bug detection: RTL analysis, flexible mechanisms, and evaluation. | 2008 |
52 | ReMAP: A Reconfigurable Heterogeneous Multicore Architecture. | 2010 |
52 | Spatiotemporal Coherence Tracking. | 2012 |
51 | Reconfigurable energy efficient near threshold cache architectures. | 2008 |
51 | Adaptive Cache Management for Energy-Efficient GPU Computing. | 2014 |
50 | Scalable Cache Miss Handling for High Memory-Level Parallelism. | 2006 |
49 | Address-Indexed Memory Disambiguation and Store-to-Load Forwarding. | 2005 |
49 | Tradeoffs in designing accelerator architectures for visual computing. | 2008 |
49 | Toward a multicore architecture for real-time ray-tracing. | 2008 |
49 | Execution leases: a hardware-supported mechanism for enforcing strong non-interference. | 2009 |
49 | In-network coherence filtering: snoopy coherence without broadcasts. | 2009 |
49 | BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. | 2009 |
49 | Combating Aging with the Colt Duty Cycle Equalizer. | 2010 |
49 | A Predictive Model for Dynamic Microarchitectural Adaptivity Control. | 2010 |
49 | Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. | 2014 |
48 | ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing. | 2005 |
48 | CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs. | 2006 |
48 | Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling. | 2010 |
48 | A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. | 2012 |
47 | Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units. | 2006 |
47 | NBTI tolerant microarchitecture design in the presence of process variation. | 2008 |
47 | Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric. | 2010 |
47 | A compile-time managed multi-level register file hierarchy. | 2011 |
47 | Linearly compressed pages: a low-complexity, low-latency main memory compression framework. | 2013 |
45 | Continuous Path and Edge Profiling. | 2005 |
45 | Token tenure: PATCHing token counting using directory-based cache coherence. | 2008 |
45 | Memory Latency Reduction via Thread Throttling. | 2010 |
45 | Fractal Coherence: Scalably Verifiable Cache Coherence. | 2010 |
45 | A new case for the TAGE branch predictor. | 2011 |
45 | FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory. | 2012 |
45 | Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. | 2013 |
44 | Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware. | 2006 |
44 | Temporal instruction fetch streaming. | 2008 |
44 | Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. | 2012 |
44 | SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. | 2014 |
43 | Reducing peak power with a table-driven adaptive processor core. | 2009 |
42 | “”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “ | 2005 |
42 | Offline symbolic analysis for multi-processor execution replay. | 2009 |
42 | Efficient Selection of Vector Instructions Using Dynamic Programming. | 2010 |
42 | A resistive TCAM accelerator for data-intensive computing. | 2011 |
42 | Rethinking DRAM Power Modes for Energy Proportionality. | 2012 |
42 | FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems. | 2014 |
41 | Adaptive data compression for high-performance low-power on-chip networks. | 2008 |
41 | Low-power, high-performance analog neural branch prediction. | 2008 |
41 | An hybrid eDRAM/SRAM macrocell to implement first-level data caches. | 2009 |
40 | A Criticality Analysis of Clustering in Superscalar Processors. | 2005 |
40 | Optimizing shared cache behavior of chip multiprocessors. | 2009 |
40 | Multiple clock and voltage domains for chip multi processors. | 2009 |
40 | Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs. | 2010 |
40 | AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection. | 2010 |
40 | Dataflow execution of sequential imperative programs on multicore architectures. | 2011 |
40 | Proactive instruction fetch. | 2011 |
40 | Managing GPU Concurrency in Heterogeneous Architectures. | 2014 |
40 | Load Value Approximation. | 2014 |
39 | NOC-Out: Microarchitecting a Scale-Out Processor. | 2012 |
38 | Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. | 2005 |
38 | Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. | 2007 |
38 | Adaptive Flow Control for Robust Performance and Energy. | 2010 |
38 | Register Cache System Not for Latency Reduction Purpose. | 2010 |
38 | Linearizing irregular memory accesses for improved correlated prefetching. | 2013 |
37 | Automatic Parallelization in a Binary Rewriter. | 2010 |
37 | Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches. | 2010 |
37 | Large-reach memory management unit caches. | 2013 |
37 | Multi-grain coherence directories. | 2013 |
37 | Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution. | 2014 |
37 | CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. | 2014 |
36 | Dataflow Predication. | 2006 |
36 | Support for High-Frequency Streaming in CMPs. | 2006 |
36 | Impact of Cache Coherence Protocols on the Processing of Network Traffic. | 2007 |
36 | Architecting a chunk-based memory race recorder in modern CMPs. | 2009 |
36 | The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory. | 2015 |
35 | Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. | 2005 |
34 | uComplexity: Estimating Processor Design Effort. | 2005 |
34 | Store Memory-Level Parallelism Optimizations for Commercial Applications. | 2005 |
34 | Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. | 2007 |
34 | Informed Microarchitecture Design Space Exploration Using Workload Dynamics. | 2007 |
34 | Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache. | 2007 |
34 | Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. | 2008 |
34 | Encore: low-cost, fine-grained transient fault recovery. | 2011 |
33 | DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage. | 2009 |
33 | PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration. | 2014 |
32 | Characterizing the resource-sharing levels in the UltraSPARC T2 processor. | 2009 |
32 | Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. | 2010 |
32 | Packet chaining: efficient single-cycle allocation for on-chip networks. | 2011 |
32 | Accurate Fine-Grained Processor Power Proxies. | 2012 |
32 | Warped gates: gating aware scheduling and power gating for GPGPUs. | 2013 |
31 | LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support. | 2010 |
31 | System-level integrated server architectures for scale-out datacenters. | 2011 |
31 | Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization. | 2012 |
31 | A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events. | 2014 |
31 | Random Fill Cache Architecture. | 2014 |
30 | Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. | 2005 |
30 | PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection. | 2006 |
30 | Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. | 2012 |
30 | Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. | 2013 |
30 | Transparent Hardware Management of Stacked DRAM as Part of Memory. | 2014 |
30 | PORPLE: An Extensible Optimizer for Portable Data Placement on GPU. | 2014 |
29 | How to Fake 1000 Registers. | 2005 |
29 | Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology. | 2008 |
29 | Scalable Speculative Parallelization on Commodity Clusters. | 2010 |
29 | Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. | 2012 |
29 | Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. | 2014 |
28 | Authentication Control Point and Its Implications For Secure Processor Design. | 2006 |
28 | The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration. | 2007 |
28 | A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags. | 2008 |
28 | A performance-correctness explicitly-decoupled architecture. | 2008 |
28 | Light64: lightweight hardware support for data race detection during systematic testing of parallel programs. | 2009 |
27 | Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths. | 2006 |
27 | Tolerating Concurrency Bugs Using Transactions as Lifeguards. | 2010 |
27 | Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks. | 2010 |
27 | Accelerating microprocessor silicon validation by exposing ISA diversity. | 2011 |
27 | CoreRacer: a practical memory race recorder for multicore x86 TSO processors. | 2011 |
27 | Formally enhanced runtime verification to ensure NoC functional correctness. | 2011 |
27 | Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits. | 2011 |
27 | Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator. | 2012 |
27 | Warped-DMR: Light-weight Error Detection for GPGPU. | 2012 |
27 | Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. | 2014 |
26 | Merging Head and Tail Duplication for Convergent Hyperblock Formation. | 2006 |
26 | Shapeshifter: Dynamically changing pipeline width and speed to address process variations. | 2008 |
26 | Control flow obfuscation with information flow tracking. | 2009 |
26 | Ordering decoupled metadata accesses in multiprocessors. | 2009 |
26 | STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches. | 2010 |
26 | Insertion and promotion for tree-based PseudoLRU last-level caches. | 2013 |
26 | Trace based phase prediction for tightly-coupled heterogeneous cores. | 2013 |
26 | Locality-Aware Mapping of Nested Parallel Patterns on GPUs. | 2014 |
26 | CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware. | 2014 |
26 | Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses. | 2015 |
25 | Exploiting Vector Parallelism in Software Pipelined Loops. | 2005 |
25 | Variation-tolerant non-uniform 3D cache management in die stacked multicore processor. | 2009 |
25 | Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance. | 2012 |
25 | AUDIT: Stress Testing the Automatic Way. | 2012 |
25 | The reuse cache: downsizing the shared last-level cache. | 2013 |
25 | Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. | 2014 |
24 | Effective Optimistic-Checker Tandem Core Design through Architectural Pruning. | 2007 |
24 | AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors. | 2010 |
24 | Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors. | 2010 |
24 | FeatherWeight: low-cost optical arbitration with QoS support. | 2011 |
24 | Enabling datacenter servers to scale out economically and sustainably. | 2013 |
24 | uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults. | 2013 |
24 | TLC: a tag-less cache for reducing dynamic first level cache energy. | 2013 |
23 | ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment. | 2010 |
22 | Adaptive and Speculative Slack Simulations of CMPs on CMPs. | 2010 |
22 | Hardware Support for Relaxed Concurrency Control in Transactional Memory. | 2010 |
22 | Idempotent processor architecture. | 2011 |
22 | Identifying and predicting timing-critical instructions to boost timing speculation. | 2011 |
21 | Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines. | 2005 |
21 | A microarchitecture-based framework for pre- and post-silicon power delivery analysis. | 2009 |
21 | Addressing End-to-End Memory Access Latency in NoC-Based Multicores. | 2012 |
20 | A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design. | 2006 |
20 | Global Multi-Threaded Instruction Scheduling. | 2007 |
20 | Erasing Core Boundaries for Robust and Configurable Performance. | 2010 |
20 | RDIP: return-address-stack directed instruction prefetching. | 2013 |
20 | Crank it up or dial it down: coordinated multiprocessor frequency and folding control. | 2013 |
20 | Skewed Compressed Caches. | 2014 |
20 | ThyNVM: enabling software-transparent crash consistency in persistent memory systems. | 2015 |
19 | Doppelgänger: a cache for approximate computing. | 2015 |
19 | Verification of chip multiprocessor memory systems using a relaxed scoreboard. | 2008 |
19 | Implementing high availability memory with a duplication cache. | 2008 |
19 | Evaluating the effects of cache redundancy on profit. | 2008 |
19 | Architectural Support for Fair Reader-Writer Locking. | 2010 |
19 | The NoX router. | 2011 |
19 | Vector Extensions for Decision Support DBMS Acceleration. | 2012 |
19 | Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. | 2012 |
19 | Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory. | 2013 |
19 | Use it or lose it: wear-out and lifetime in future chip multiprocessors. | 2013 |
19 | Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults. | 2014 |
19 | Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks. | 2014 |
19 | Futility Scaling: High-Associativity Cache Partitioning. | 2014 |
19 | Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. | 2014 |
18 | DMDC: Delayed Memory Dependence Checking through Age-Based Filtering. | 2006 |
18 | Virtually Pipelined Network Memory. | 2006 |
18 | Strategies for mapping dataflow blocks to distributed hardware. | 2008 |
18 | Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. | 2010 |
18 | Resilient microring resonator based photonic networks. | 2011 |
18 | Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists. | 2014 |
18 | Neural acceleration for GPU throughput processors. | 2015 |
18 | Jump over ASLR: Attacking branch predictors to bypass ASLR. | 2016 |
17 | Manager-client pairing: a framework for implementing coherence hierarchies. | 2011 |
16 | Serialization-Aware Mini-Graphs: Performance with Fewer Resources. | 2006 |
16 | Time Interpolation: So Many Metrics, So Few Registers. | 2007 |
16 | Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models. | 2014 |
16 | Harnessing Soft Computations for Low-Budget Fault Tolerance. | 2014 |
16 | Large pages and lightweight memory management in virtualized environments: can you have it both ways? | 2015 |
15 | Testudo: Heavyweight security analysis via statistical sampling. | 2008 |
15 | SHARK: Architectural support for autonomic protection against stealth by rootkit exploits. | 2008 |
15 | A systematic methodology to develop resilient cache coherence protocols. | 2011 |
15 | A data layout optimization framework for NUCA-based multicores. | 2011 |
15 | Inferred Models for Dynamic and Sparse Hardware-Software Spaces. | 2012 |
15 | Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. | 2012 |
15 | BuMP: Bulk Memory Access Prediction and Streaming. | 2014 |
14 | CCICheck: usingµhb graphs to verify the coherence-consistency interface. | 2015 |
14 | Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows. | 2005 |
14 | Optimal versus Heuristic Global Code Scheduling. | 2007 |
14 | InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing. | 2010 |
14 | Virtual Snooping: Filtering Snoops in Virtualized Multi-cores. | 2010 |
14 | SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads. | 2012 |
14 | SHIFT: shared history instruction fetch for lean-core server processors. | 2013 |
14 | Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities. | 2014 |
14 | PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. | 2014 |
13 | Using a configurable processor generator for computer architecture prototyping. | 2009 |
13 | POWER7 multi-core processor design. | 2009 |
13 | Energy efficient GPU transactional memory via space-time optimizations. | 2013 |
13 | Imbalanced cache partitioning for balanced data-parallel programs. | 2013 |
13 | Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures. | 2014 |
13 | Multi-GPU System Design with Memory Networks. | 2014 |
13 | Arbitrary Modulus Indexing. | 2014 |
13 | Efficient persist barriers for multicores. | 2015 |
12 | A register-file approach for row buffer caches in die-stacked DRAMs. | 2011 |
12 | NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? | 2014 |
12 | Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. | 2014 |
12 | Architectural Specialization for Inter-Iteration Loop Dependence Patterns. | 2014 |
12 | Free launch: optimizing GPU dynamic kernel launches through thread reuse. | 2015 |
12 | Enabling interposer-based disintegration of multi-core processors. | 2015 |
11 | Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection. | 2006 |
11 | Memory Protection through Dynamic Access Control. | 2006 |
11 | Complementing user-level coarse-grain parallelism with implicit speculative parallelism. | 2011 |
11 | The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults. | 2012 |
11 | DESC: energy-efficient data exchange using synchronized counters. | 2013 |
11 | Efficient multiprogramming for multicores with SCAF. | 2013 |
11 | B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors. | 2014 |
11 | Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. | 2015 |
11 | Efficiently prefetching complex address patterns. | 2015 |
10 | The Future Evolution of High-Performance Microprocessors. | 2005 |
10 | Efficient Use of Invisible Registers in Thumb Code. | 2005 |
10 | Tree register allocation. | 2009 |
10 | MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP. | 2013 |
10 | Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems. | 2014 |
10 | Avoiding information leakage in the memory controller with fixed service policies. | 2015 |
10 | A scalable architecture for ordered parallelism. | 2015 |
10 | A cloud-scale acceleration architecture. | 2016 |
9 | A distributed processor state management architecture for large-window processors. | 2008 |
9 | ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer. | 2011 |
9 | Predicting Coherence Communication by Tracking Synchronization Points at Run Time. | 2012 |
9 | Efficient management of last-level caches in graphics processors for 3D scene rendering workloads. | 2013 |
9 | Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. | 2013 |
9 | Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration. | 2014 |
9 | RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. | 2014 |
9 | Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. | 2015 |
9 | Fast support for unstructured data processing: the unified automata processor. | 2015 |
9 | Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. | 2015 |
8 | Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects. | 2013 |
8 | Dodec: Random-Link, Low-Radix On-Chip Networks. | 2014 |
8 | Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors. | 2014 |
8 | GPU register file virtualization. | 2015 |
8 | Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches. | 2015 |
8 | Coherence domain restriction on large scale systems. | 2015 |
8 | Efficient GPU synchronization without scopes: saying no to complex consistency models. | 2015 |
8 | Rubik: fast analytical power management for latency-critical systems. | 2015 |
8 | Delegated persist ordering. | 2016 |
7 | Control-Flow Decoupling. | 2012 |
7 | A Front-End Execution Architecture for High Energy Efficiency. | 2014 |
7 | Short-Circuiting Memory Traffic in Handheld Platforms. | 2014 |
7 | Execution Drafting: Energy Efficiency through Computation Deduplication. | 2014 |
7 | Improving DRAM latency with dynamic asymmetric subarray. | 2015 |
7 | The inner most loop iteration counter: a new dimension in branch history. | 2015 |
7 | TimeTrader: exploiting latency tail to save datacenter energy for online search. | 2015 |
7 | Fork path: improving efficiency of ORAM by removing redundant memory accesses. | 2015 |
7 | IMP: indirect memory prefetcher. | 2015 |
7 | Stripes: Bit-serial deep neural network computing. | 2016 |
6 | Incremental Commit Groups for Non-Atomic Trace Processing. | 2005 |
6 | Architecture-aware automatic computation offload for native applications. | 2015 |
6 | Border control: sandboxing accelerators. | 2015 |
6 | Microarchitectural implications of event-driven server-side web applications. | 2015 |
6 | Efficient warp execution in presence of divergence with collaborative context collection. | 2015 |
6 | Characterizing, modeling, and improving the QoE of mobile devices with low battery level. | 2015 |
5 | Data-Dependency Graph Transformations for Superblock Scheduling. | 2006 |
5 | TransCom: transforming stream communication for load balance and efficiency in networks-on-chip. | 2011 |
5 | Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem. | 2012 |
5 | Compiler Support for Optimizing Memory Bank-Level Parallelism. | 2014 |
5 | Wormhole: Wisely Predicting Multidimensional Branches. | 2014 |
5 | Loop-Aware Memory Prefetching Using Code Block Working Sets. | 2014 |
5 | The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU. | 2015 |
5 | An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors. | 2015 |
5 | Prediction-guided performance-energy trade-off for interactive applications. | 2015 |
5 | Continuous runahead: Transparent hardware acceleration for memory intensive workloads. | 2016 |
5 | Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. | 2016 |
4 | Why design must change: rethinking digital design. | 2009 |
4 | GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. | 2014 |
4 | Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach. | 2015 |
4 | A fast and accurate analytical technique to compute the AVF of sequential bits in a processor. | 2015 |
4 | Efficiently enforcing strong memory ordering in GPUs. | 2015 |
4 | Authenticache: harnessing cache ECC for system authentication. | 2015 |
4 | Execution time prediction for energy-efficient hardware accelerators. | 2015 |
4 | Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection. | 2016 |
4 | Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. | 2016 |
4 | Co-designing accelerators and SoC interfaces using gem5-Aladdin. | 2016 |
4 | Improving bank-level parallelism for irregular applications. | 2016 |
4 | KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. | 2016 |
3 | SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations. | 2012 |
3 | Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection. | 2013 |
3 | DeSC: decoupled supply-compute communication management for heterogeneous architectures. | 2015 |
3 | HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. | 2015 |
3 | Locking down insecure indirection with hardware-based control-data isolation. | 2015 |
3 | Modeling the implications of DRAM failures and protection techniques on datacenter TCO. | 2015 |
3 | More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes. | 2015 |
3 | Fused-layer CNN accelerators. | 2016 |
3 | Towards efficient server architecture for virtualized network function deployment: Implications and implementations. | 2016 |
3 | Racer: TSO consistency via race detection. | 2016 |
2 | Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins. | 2008 |
2 | CRAM: coded registers for amplified multiporting. | 2011 |
2 | Allocating rotating registers by scheduling. | 2013 |
2 | Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes. | 2013 |
2 | Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations. | 2014 |
2 | Continuous, Low Overhead, Run-Time Validation of Program Executions. | 2014 |
2 | Bias-Free Branch Predictor. | 2014 |
2 | Bungee jumps: accelerating indirect branches through HW/SW co-design. | 2015 |
2 | Adaptive guardband scheduling to improve system-level efficiency of the POWER7+. | 2015 |
2 | MORC: a manycore-oriented compressed cache. | 2015 |
2 | CLEAN-ECC: high reliability ECC for adaptive granularity memory system. | 2015 |
2 | DynaMOS: dynamic schedule migration for heterogeneous cores. | 2015 |
2 | Self-contained, accurate precomputation prefetching. | 2015 |
2 | Confluence: unified instruction supply for scale-out servers. | 2015 |
2 | Filtered runahead execution with a runahead buffer. | 2015 |
2 | SABRes: Atomic object reads for in-memory rack-scale computing. | 2016 |
2 | Cambricon-X: An accelerator for sparse neural networks. | 2016 |
2 | Efficient kernel synthesis for performance portable programming. | 2016 |
2 | Chainsaw: Von-neumann accelerators to leverage fused instruction chains. | 2016 |
2 | Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach. | 2016 |
2 | Spectral profiling: Observer-effect-free profiling by monitoring EM emanations. | 2016 |
2 | From high-level deep neural models to FPGAs. | 2016 |
1 | Microarchitecture in the system-level integration era. | 2008 |
1 | BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment. | 2013 |
1 | COMP: Compiler Optimizations for Manycore Processors. | 2014 |
1 | SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. | 2015 |
1 | vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments. | 2015 |
1 | WarpPool: sharing requests with inter-warp coalescing for throughput processors. | 2015 |
1 | Enabling portable energy efficiency with memory accelerated library. | 2015 |
1 | DCS: a fast and scalable device-centric server architecture. | 2015 |
1 | Long term parking (LTP): criticality-aware resource allocation in OOO processors. | 2015 |
1 | A unified memory network architecture for in-memory computing in commodity servers. | 2016 |
1 | Path confidence based lookahead prefetching. | 2016 |
1 | Ti-states: Processor power management in the temperature inversion region. | 2016 |
1 | Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. | 2016 |
1 | Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. | 2016 |
1 | An ultra low-power hardware accelerator for automatic speech recognition. | 2016 |
1 | HARE: Hardware accelerator for regular expressions. | 2016 |
1 | Evaluating programmable architectures for imaging and vision applications. | 2016 |
1 | Lazy release consistency for GPUs. | 2016 |
1 | Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting. | 2016 |
1 | Quantifying and improving the efficiency of hardware-based mobile malware detectors. | 2016 |
1 | vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. | 2016 |
1 | Perceptron learning for reuse prediction. | 2016 |
1 | NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. | 2016 |
1 | C3D: Mitigating the NUMA bottleneck via coherent DRAM caches. | 2016 |
1 | OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. | 2016 |
0 | Message from the General Chairs. | 2005 |
0 | Message from the Program Co-Chairs. | 2005 |
0 | Control flow coalescing on a hybrid dataflow/von Neumann GPGPU. | 2015 |
0 | Ultra-low power render-based collision detection for CPU/GPU systems. | 2015 |
0 | Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks. | 2016 |
0 | pTask: A smart prefetching scheme for OS intensive applications. | 2016 |
0 | MIMD synchronization on SIMT architectures. | 2016 |
0 | Redefining QoS and customizing the power management policy to satisfy individual mobile users. | 2016 |
0 | Contention-based congestion management in large-scale networks. | 2016 |
0 | PoisonIvy: Safe speculation for secure memory. | 2016 |
0 | The Bunker Cache for spatio-value approximation. | 2016 |
0 | Register sharing for equality prediction. | 2016 |
0 | CrystalBall: Statically analyzing runtime behavior via deep sequence learning. | 2016 |
0 | ReplayConfusion: Detecting cache-based covert channel attacks using record and replay. | 2016 |
0 | Dynamic error mitigation in NoCs using intelligent prediction techniques. | 2016 |
0 | Zorua: A holistic approach to resource virtualization in GPUs. | 2016 |
0 | A patch memory system for image processing and computer vision. | 2016 |
0 | Improving energy efficiency of DRAM by exploiting half page row access. | 2016 |
0 | Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. | 2016 |
0 | The microarchitecture of a real-time robot motion planning accelerator. | 2016 |
0 | CANDY: Enabling coherent DRAM caches for multi-node systems. | 2016 |
0 | GRAPE: Minimizing energy for GPU applications with performance requirements. | 2016 |
0 | Exploiting semantic commutativity in hardware speculation. | 2016 |
0 | Dictionary sharing: An efficient cache compression scheme for compressed caches. | 2016 |
0 | Data-centric execution of speculative parallel programs. | 2016 |
0 | NeSC: Self-virtualizing nested storage controller. | 2016 |
0 | Reducing data movement energy via online data clustering and encoding. | 2016 |
0 | Keynotes: Internet of Things: History and hype, technology and policy. | 2016 |
0 | Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation. | 2016 |
2016¶
Cited by | Paper title |
---|---|
18 | Jump over ASLR: Attacking branch predictors to bypass ASLR. |
10 | A cloud-scale acceleration architecture. |
8 | Delegated persist ordering. |
7 | Stripes: Bit-serial deep neural network computing. |
5 | Continuous runahead: Transparent hardware acceleration for memory intensive workloads. |
5 | Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. |
4 | Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection. |
4 | Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. |
4 | Co-designing accelerators and SoC interfaces using gem5-Aladdin. |
4 | Improving bank-level parallelism for irregular applications. |
4 | KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. |
3 | Fused-layer CNN accelerators. |
3 | Towards efficient server architecture for virtualized network function deployment: Implications and implementations. |
3 | Racer: TSO consistency via race detection. |
2 | SABRes: Atomic object reads for in-memory rack-scale computing. |
2 | Cambricon-X: An accelerator for sparse neural networks. |
2 | Efficient kernel synthesis for performance portable programming. |
2 | Chainsaw: Von-neumann accelerators to leverage fused instruction chains. |
2 | Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach. |
2 | Spectral profiling: Observer-effect-free profiling by monitoring EM emanations. |
2 | From high-level deep neural models to FPGAs. |
1 | A unified memory network architecture for in-memory computing in commodity servers. |
1 | Path confidence based lookahead prefetching. |
1 | Ti-states: Processor power management in the temperature inversion region. |
1 | Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. |
1 | Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. |
1 | An ultra low-power hardware accelerator for automatic speech recognition. |
1 | HARE: Hardware accelerator for regular expressions. |
1 | Evaluating programmable architectures for imaging and vision applications. |
1 | Lazy release consistency for GPUs. |
1 | Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting. |
1 | Quantifying and improving the efficiency of hardware-based mobile malware detectors. |
1 | vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. |
1 | Perceptron learning for reuse prediction. |
1 | NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. |
1 | C3D: Mitigating the NUMA bottleneck via coherent DRAM caches. |
1 | OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. |
0 | Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks. |
0 | pTask: A smart prefetching scheme for OS intensive applications. |
0 | MIMD synchronization on SIMT architectures. |
0 | Redefining QoS and customizing the power management policy to satisfy individual mobile users. |
0 | Contention-based congestion management in large-scale networks. |
0 | PoisonIvy: Safe speculation for secure memory. |
0 | The Bunker Cache for spatio-value approximation. |
0 | Register sharing for equality prediction. |
0 | CrystalBall: Statically analyzing runtime behavior via deep sequence learning. |
0 | ReplayConfusion: Detecting cache-based covert channel attacks using record and replay. |
0 | Dynamic error mitigation in NoCs using intelligent prediction techniques. |
0 | Zorua: A holistic approach to resource virtualization in GPUs. |
0 | A patch memory system for image processing and computer vision. |
0 | Improving energy efficiency of DRAM by exploiting half page row access. |
0 | Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. |
0 | The microarchitecture of a real-time robot motion planning accelerator. |
0 | CANDY: Enabling coherent DRAM caches for multi-node systems. |
0 | GRAPE: Minimizing energy for GPU applications with performance requirements. |
0 | Exploiting semantic commutativity in hardware speculation. |
0 | Dictionary sharing: An efficient cache compression scheme for compressed caches. |
0 | Data-centric execution of speculative parallel programs. |
0 | NeSC: Self-virtualizing nested storage controller. |
0 | Reducing data movement energy via online data clustering and encoding. |
0 | Keynotes: Internet of Things: History and hype, technology and policy. |
0 | Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation. |
2015¶
Cited by | Paper title |
---|---|
36 | The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory. |
26 | Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses. |
20 | ThyNVM: enabling software-transparent crash consistency in persistent memory systems. |
19 | Doppelgänger: a cache for approximate computing. |
18 | Neural acceleration for GPU throughput processors. |
16 | Large pages and lightweight memory management in virtualized environments: can you have it both ways? |
14 | CCICheck: usingµhb graphs to verify the coherence-consistency interface. |
13 | Efficient persist barriers for multicores. |
12 | Free launch: optimizing GPU dynamic kernel launches through thread reuse. |
12 | Enabling interposer-based disintegration of multi-core processors. |
11 | Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. |
11 | Efficiently prefetching complex address patterns. |
10 | Avoiding information leakage in the memory controller with fixed service policies. |
10 | A scalable architecture for ordered parallelism. |
9 | Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. |
9 | Fast support for unstructured data processing: the unified automata processor. |
9 | Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. |
8 | GPU register file virtualization. |
8 | Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches. |
8 | Coherence domain restriction on large scale systems. |
8 | Efficient GPU synchronization without scopes: saying no to complex consistency models. |
8 | Rubik: fast analytical power management for latency-critical systems. |
7 | Improving DRAM latency with dynamic asymmetric subarray. |
7 | The inner most loop iteration counter: a new dimension in branch history. |
7 | TimeTrader: exploiting latency tail to save datacenter energy for online search. |
7 | Fork path: improving efficiency of ORAM by removing redundant memory accesses. |
7 | IMP: indirect memory prefetcher. |
6 | Architecture-aware automatic computation offload for native applications. |
6 | Border control: sandboxing accelerators. |
6 | Microarchitectural implications of event-driven server-side web applications. |
6 | Efficient warp execution in presence of divergence with collaborative context collection. |
6 | Characterizing, modeling, and improving the QoE of mobile devices with low battery level. |
5 | The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU. |
5 | An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors. |
5 | Prediction-guided performance-energy trade-off for interactive applications. |
4 | Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach. |
4 | A fast and accurate analytical technique to compute the AVF of sequential bits in a processor. |
4 | Efficiently enforcing strong memory ordering in GPUs. |
4 | Authenticache: harnessing cache ECC for system authentication. |
4 | Execution time prediction for energy-efficient hardware accelerators. |
3 | DeSC: decoupled supply-compute communication management for heterogeneous architectures. |
3 | HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. |
3 | Locking down insecure indirection with hardware-based control-data isolation. |
3 | Modeling the implications of DRAM failures and protection techniques on datacenter TCO. |
3 | More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes. |
2 | Bungee jumps: accelerating indirect branches through HW/SW co-design. |
2 | Adaptive guardband scheduling to improve system-level efficiency of the POWER7+. |
2 | MORC: a manycore-oriented compressed cache. |
2 | CLEAN-ECC: high reliability ECC for adaptive granularity memory system. |
2 | DynaMOS: dynamic schedule migration for heterogeneous cores. |
2 | Self-contained, accurate precomputation prefetching. |
2 | Confluence: unified instruction supply for scale-out servers. |
2 | Filtered runahead execution with a runahead buffer. |
1 | SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. |
1 | vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments. |
1 | WarpPool: sharing requests with inter-warp coalescing for throughput processors. |
1 | Enabling portable energy efficiency with memory accelerated library. |
1 | DCS: a fast and scalable device-centric server architecture. |
1 | Long term parking (LTP): criticality-aware resource allocation in OOO processors. |
0 | Control flow coalescing on a hybrid dataflow/von Neumann GPGPU. |
0 | Ultra-low power render-based collision detection for CPU/GPU systems. |
2014¶
Cited by | Paper title |
---|---|
169 | DaDianNao: A Machine-Learning Supercomputer. |
51 | Adaptive Cache Management for Energy-Efficient GPU Computing. |
49 | Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. |
44 | SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. |
42 | FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems. |
40 | Managing GPU Concurrency in Heterogeneous Architectures. |
40 | Load Value Approximation. |
37 | Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution. |
37 | CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. |
33 | PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration. |
31 | A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events. |
31 | Random Fill Cache Architecture. |
30 | Transparent Hardware Management of Stacked DRAM as Part of Memory. |
30 | PORPLE: An Extensible Optimizer for Portable Data Placement on GPU. |
29 | Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. |
27 | Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. |
26 | Locality-Aware Mapping of Nested Parallel Patterns on GPUs. |
26 | CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware. |
25 | Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. |
20 | Skewed Compressed Caches. |
19 | Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults. |
19 | Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks. |
19 | Futility Scaling: High-Associativity Cache Partitioning. |
19 | Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. |
18 | Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists. |
16 | Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models. |
16 | Harnessing Soft Computations for Low-Budget Fault Tolerance. |
15 | BuMP: Bulk Memory Access Prediction and Streaming. |
14 | Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities. |
14 | PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. |
13 | Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures. |
13 | Multi-GPU System Design with Memory Networks. |
13 | Arbitrary Modulus Indexing. |
12 | NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? |
12 | Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. |
12 | Architectural Specialization for Inter-Iteration Loop Dependence Patterns. |
11 | B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors. |
10 | Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems. |
9 | Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration. |
9 | RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. |
8 | Dodec: Random-Link, Low-Radix On-Chip Networks. |
8 | Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors. |
7 | A Front-End Execution Architecture for High Energy Efficiency. |
7 | Short-Circuiting Memory Traffic in Handheld Platforms. |
7 | Execution Drafting: Energy Efficiency through Computation Deduplication. |
5 | Compiler Support for Optimizing Memory Bank-Level Parallelism. |
5 | Wormhole: Wisely Predicting Multidimensional Branches. |
5 | Loop-Aware Memory Prefetching Using Code Block Working Sets. |
4 | GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. |
2 | Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations. |
2 | Continuous, Low Overhead, Run-Time Validation of Program Executions. |
2 | Bias-Free Branch Predictor. |
1 | COMP: Compiler Optimizations for Manycore Processors. |
2013¶
Cited by | Paper title |
---|---|
136 | Approximate storage in solid-state memories. |
111 | SAGE: self-tuning approximation for graphics engines. |
107 | Quality programmable vector processors for approximate computing. |
80 | Meet the walkers: accelerating index traversals for in-memory databases. |
78 | Kiln: closing the performance gap between systems with and without persistence support. |
69 | Divergence-aware warp scheduling. |
68 | RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization. |
59 | Heterogeneous system coherence for integrated CPU-GPU systems. |
53 | A locality-aware memory hierarchy for energy-efficient GPU architectures. |
47 | Linearly compressed pages: a low-complexity, low-latency main memory compression framework. |
45 | Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. |
38 | Linearizing irregular memory accesses for improved correlated prefetching. |
37 | Large-reach memory management unit caches. |
37 | Multi-grain coherence directories. |
32 | Warped gates: gating aware scheduling and power gating for GPGPUs. |
30 | Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. |
26 | Insertion and promotion for tree-based PseudoLRU last-level caches. |
26 | Trace based phase prediction for tightly-coupled heterogeneous cores. |
25 | The reuse cache: downsizing the shared last-level cache. |
24 | Enabling datacenter servers to scale out economically and sustainably. |
24 | uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults. |
24 | TLC: a tag-less cache for reducing dynamic first level cache energy. |
20 | RDIP: return-address-stack directed instruction prefetching. |
20 | Crank it up or dial it down: coordinated multiprocessor frequency and folding control. |
19 | Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory. |
19 | Use it or lose it: wear-out and lifetime in future chip multiprocessors. |
14 | SHIFT: shared history instruction fetch for lean-core server processors. |
13 | Energy efficient GPU transactional memory via space-time optimizations. |
13 | Imbalanced cache partitioning for balanced data-parallel programs. |
11 | DESC: energy-efficient data exchange using synchronized counters. |
11 | Efficient multiprogramming for multicores with SCAF. |
10 | MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP. |
9 | Efficient management of last-level caches in graphics processors for 3D scene rendering workloads. |
9 | Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. |
8 | Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects. |
3 | Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection. |
2 | Allocating rotating registers by scheduling. |
2 | Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes. |
1 | BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment. |
2012¶
Cited by | Paper title |
---|---|
321 | Neural Acceleration for General-Purpose Approximate Programs. |
206 | Cache-Conscious Wavefront Scheduling. |
125 | Transactional Memory Architecture and Implementation for IBM System Z. |
123 | Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. |
111 | CoScale: Coordinating CPU and Memory System DVFS in Server Systems. |
99 | Composite Cores: Pushing Heterogeneity Into a Core. |
73 | Improving Cache Management Policies Using Dynamic Reuse Distances. |
72 | KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity. |
70 | NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers. |
70 | Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. |
66 | Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. |
60 | Predicting Performance Impact of DVFS for Realistic Memory Systems. |
59 | MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP. |
54 | NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures. |
54 | CoLT: Coalesced Large-Reach TLBs. |
52 | Spatiotemporal Coherence Tracking. |
48 | A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. |
45 | FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory. |
44 | Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. |
42 | Rethinking DRAM Power Modes for Energy Proportionality. |
39 | NOC-Out: Microarchitecting a Scale-Out Processor. |
32 | Accurate Fine-Grained Processor Power Proxies. |
31 | Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization. |
30 | Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. |
29 | Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. |
27 | Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator. |
27 | Warped-DMR: Light-weight Error Detection for GPGPU. |
25 | Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance. |
25 | AUDIT: Stress Testing the Automatic Way. |
21 | Addressing End-to-End Memory Access Latency in NoC-Based Multicores. |
19 | Vector Extensions for Decision Support DBMS Acceleration. |
19 | Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. |
15 | Inferred Models for Dynamic and Sparse Hardware-Software Spaces. |
15 | Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. |
14 | SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads. |
11 | The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults. |
9 | Predicting Coherence Communication by Tracking Synchronization Points at Run Time. |
7 | Control-Flow Decoupling. |
5 | Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem. |
3 | SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations. |
2011¶
Cited by | Paper title |
---|---|
285 | Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. |
253 | Improving GPU performance via large warps and two-level warp scheduling. |
170 | Multi retention level STT-RAM cache designs with a dynamic refresh scheme. |
166 | Reducing memory interference in multicore systems via application-aware memory channel partitioning. |
146 | Pack&Cap: adaptive DVFS and thread packing under power caps. |
129 | SHiP: signature-based hit predictor for high performance caching. |
123 | Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. |
112 | QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores. |
104 | Parallel application memory scheduling. |
96 | Active management of timing guardband to save energy in POWER7. |
87 | Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era. |
86 | Bundled execution of recurring traces for energy-efficient general purpose processing. |
78 | PACMan: prefetch-aware cache management for high performance caching. |
75 | Pay-As-You-Go: low-overhead hard-error correction for phase change memories. |
67 | SIMD re-convergence at thread frontiers. |
61 | Preventing PCM banks from seizing too much power. |
60 | Hardware transactional memory for GPU architectures. |
58 | Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication. |
58 | Architectural support for secure virtualization under a vulnerable hypervisor. |
47 | A compile-time managed multi-level register file hierarchy. |
45 | A new case for the TAGE branch predictor. |
42 | A resistive TCAM accelerator for data-intensive computing. |
40 | Dataflow execution of sequential imperative programs on multicore architectures. |
40 | Proactive instruction fetch. |
34 | Encore: low-cost, fine-grained transient fault recovery. |
32 | Packet chaining: efficient single-cycle allocation for on-chip networks. |
31 | System-level integrated server architectures for scale-out datacenters. |
27 | Accelerating microprocessor silicon validation by exposing ISA diversity. |
27 | CoreRacer: a practical memory race recorder for multicore x86 TSO processors. |
27 | Formally enhanced runtime verification to ensure NoC functional correctness. |
27 | Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits. |
24 | FeatherWeight: low-cost optical arbitration with QoS support. |
22 | Idempotent processor architecture. |
22 | Identifying and predicting timing-critical instructions to boost timing speculation. |
19 | The NoX router. |
18 | Resilient microring resonator based photonic networks. |
17 | Manager-client pairing: a framework for implementing coherence hierarchies. |
15 | A systematic methodology to develop resilient cache coherence protocols. |
15 | A data layout optimization framework for NUCA-based multicores. |
12 | A register-file approach for row buffer caches in die-stacked DRAMs. |
11 | Complementing user-level coarse-grain parallelism with implicit speculative parallelism. |
9 | ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer. |
5 | TransCom: transforming stream communication for load balance and efficiency in networks-on-chip. |
2 | CRAM: coded registers for amplified multiporting. |
2010¶
Cited by | Paper title |
---|---|
294 | Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. |
210 | Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? |
210 | Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories. |
146 | The ZCache: Decoupling Ways and Associativity. |
143 | Understanding the Energy Consumption of Dynamic Random Access Memories. |
135 | SAFER: Stuck-At-Fault Error Recovery for Memories. |
106 | Sampling Dead Block Prediction for Last-Level Caches. |
96 | Task Superscalar: An Out-of-Order Task Pipeline. |
94 | Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. |
90 | Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies. |
88 | SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. |
83 | Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory. |
82 | Throughput-Effective On-Chip Networks for Manycore Accelerators. |
67 | ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. |
63 | A Dynamically Adaptable Hardware Transactional Memory. |
52 | ReMAP: A Reconfigurable Heterogeneous Multicore Architecture. |
49 | Combating Aging with the Colt Duty Cycle Equalizer. |
49 | A Predictive Model for Dynamic Microarchitectural Adaptivity Control. |
48 | Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling. |
47 | Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric. |
45 | Memory Latency Reduction via Thread Throttling. |
45 | Fractal Coherence: Scalably Verifiable Cache Coherence. |
42 | Efficient Selection of Vector Instructions Using Dynamic Programming. |
40 | Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs. |
40 | AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection. |
38 | Adaptive Flow Control for Robust Performance and Energy. |
38 | Register Cache System Not for Latency Reduction Purpose. |
37 | Automatic Parallelization in a Binary Rewriter. |
37 | Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches. |
32 | Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. |
31 | LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support. |
29 | Scalable Speculative Parallelization on Commodity Clusters. |
27 | Tolerating Concurrency Bugs Using Transactions as Lifeguards. |
27 | Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks. |
26 | STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches. |
24 | AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors. |
24 | Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors. |
23 | ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment. |
22 | Adaptive and Speculative Slack Simulations of CMPs on CMPs. |
22 | Hardware Support for Relaxed Concurrency Control in Transactional Memory. |
20 | Erasing Core Boundaries for Robust and Configurable Performance. |
19 | Architectural Support for Fair Reader-Writer Locking. |
18 | Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. |
14 | InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing. |
14 | Virtual Snooping: Filtering Snoops in Virtualized Multi-cores. |
2009¶
Cited by | Paper title |
---|---|
1501 | McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. |
470 | Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. |
465 | Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. |
335 | Characterizing flash memory: anomalies, observations, and applications. |
334 | Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance. |
296 | Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures. |
165 | Application-aware prioritization mechanisms for on-chip networks. |
148 | Low-cost router microarchitecture for on-chip networks. |
148 | SCARAB: a single cycle adaptive routing and bufferless network. |
139 | A tagless coherence directory. |
134 | Improving cache lifetime reliability at ultra-low voltages. |
129 | Coordinated control of multiple prefetchers in multi-core systems. |
120 | Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. |
118 | EazyHTM: eager-lazy hardware transactional memory. |
118 | Characterizing and mitigating the impact of process variations on phase change based memory systems. |
115 | Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. |
113 | Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip. |
105 | Light speed arbitration and flow control for nanophotonic interconnects. |
99 | Complexity effective memory access scheduling for many-core accelerator architectures. |
94 | A case for dynamic frequency tuning in on-chip networks. |
89 | Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. |
84 | The BubbleWrap many-core: popping cores for sequential acceleration. |
81 | Finding concurrency bugs with context-aware communication graphs. |
78 | mSWAT: low-cost hardware fault detection and diagnosis for multicore systems. |
77 | Low Vccmin fault-tolerant cache with highly predictable performance. |
77 | ZerehCache: armoring cache architectures in high defect density technologies. |
74 | Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. |
73 | Adaptive line placement with theset balancing cache. |
69 | Improving memory bank-level parallelism in the presence of prefetching. |
63 | Tribeca: design for PVT variations with local recovery and fine-grained adaptation. |
60 | ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem. |
56 | SHARP control: controlled shared cache management in chip multiprocessors. |
55 | Proactive transaction scheduling for contention management. |
54 | Portable compiler optimisation across embedded programs and microarchitectures using machine learning. |
49 | Execution leases: a hardware-supported mechanism for enforcing strong non-interference. |
49 | In-network coherence filtering: snoopy coherence without broadcasts. |
49 | BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. |
43 | Reducing peak power with a table-driven adaptive processor core. |
42 | Offline symbolic analysis for multi-processor execution replay. |
41 | An hybrid eDRAM/SRAM macrocell to implement first-level data caches. |
40 | Optimizing shared cache behavior of chip multiprocessors. |
40 | Multiple clock and voltage domains for chip multi processors. |
36 | Architecting a chunk-based memory race recorder in modern CMPs. |
33 | DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage. |
32 | Characterizing the resource-sharing levels in the UltraSPARC T2 processor. |
28 | Light64: lightweight hardware support for data race detection during systematic testing of parallel programs. |
26 | Control flow obfuscation with information flow tracking. |
26 | Ordering decoupled metadata accesses in multiprocessors. |
25 | Variation-tolerant non-uniform 3D cache management in die stacked multicore processor. |
21 | A microarchitecture-based framework for pre- and post-silicon power delivery analysis. |
13 | Using a configurable processor generator for computer architecture prototyping. |
13 | POWER7 multi-core processor design. |
10 | Tree register allocation. |
4 | Why design must change: rethinking digital design. |
2008¶
Cited by | Paper title |
---|---|
211 | Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. |
203 | Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. |
185 | Facelift: Hiding and slowing down aging in multicores. |
153 | Prefetch-Aware DRAM Controllers. |
147 | Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. |
137 | Copy or Discard execution model for speculative parallelization on multicores. |
129 | A novel cache architecture with enhanced performance and security. |
113 | Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. |
113 | Token flow control. |
104 | Dependence-aware transactional memory for increased concurrency. |
96 | From SODA to scotch: The evolution of a wireless baseband processor. |
91 | EVAL: Utilizing processors with variation-induced timing errors. |
90 | Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. |
85 | Efficient unicast and multicast support for CMPs. |
80 | The StageNet fabric for constructing resilient multicore systems. |
66 | Power reduction of CMP communication networks via RF-interconnects. |
60 | CPR: Composable performance regression for scalable multiprocessor models. |
60 | Notary: Hardware techniques to enhance signatures. |
59 | Power to the people: Leveraging human physiological traits to control microprocessor frequency. |
52 | Online design bug detection: RTL analysis, flexible mechanisms, and evaluation. |
51 | Reconfigurable energy efficient near threshold cache architectures. |
49 | Tradeoffs in designing accelerator architectures for visual computing. |
49 | Toward a multicore architecture for real-time ray-tracing. |
47 | NBTI tolerant microarchitecture design in the presence of process variation. |
45 | Token tenure: PATCHing token counting using directory-based cache coherence. |
44 | Temporal instruction fetch streaming. |
41 | Adaptive data compression for high-performance low-power on-chip networks. |
41 | Low-power, high-performance analog neural branch prediction. |
34 | Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. |
29 | Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology. |
28 | A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags. |
28 | A performance-correctness explicitly-decoupled architecture. |
26 | Shapeshifter: Dynamically changing pipeline width and speed to address process variations. |
19 | Verification of chip multiprocessor memory systems using a relaxed scoreboard. |
19 | Implementing high availability memory with a duplication cache. |
19 | Evaluating the effects of cache redundancy on profit. |
18 | Strategies for mapping dataflow blocks to distributed hardware. |
15 | Testudo: Heavyweight security analysis via statistical sampling. |
15 | SHARK: Architectural support for autonomic protection against stealth by rootkit exploits. |
9 | A distributed processor state management architecture for large-window processors. |
2 | Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins. |
1 | Microarchitecture in the system-level integration era. |
2007¶
Cited by | Paper title |
---|---|
449 | Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. |
420 | Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. |
365 | Flattened Butterfly Topology for On-Chip Networks. |
358 | Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. |
247 | Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. |
241 | Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. |
197 | A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs. |
192 | Composable Lightweight Processors. |
184 | Penelope: The NBTI-Aware Processor. |
179 | Revisiting the Sequential Programming Model for Multi-Core. |
172 | Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. |
157 | FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. |
136 | Implementing Signatures for Transactional Memory. |
127 | A Framework for Providing Quality of Service in Chip Multi-Processors. |
112 | Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing. |
108 | Process Variation Tolerant 3T1D-Based Cache Architectures. |
107 | Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation. |
102 | Self-calibrating Online Wearout Detection. |
99 | Leveraging 3D Technology for Improved Reliability. |
92 | Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly. |
75 | Microarchitectural Design Space Exploration Using an Architecture-Centric Approach. |
62 | A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. |
61 | Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. |
59 | Scavenger: A New Last Level Cache Architecture with Global Block Priority. |
52 | Emulating Optimal Replacement with a Shepherd Cache. |
38 | Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. |
36 | Impact of Cache Coherence Protocols on the Processing of Network Traffic. |
34 | Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. |
34 | Informed Microarchitecture Design Space Exploration Using Workload Dynamics. |
34 | Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache. |
28 | The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration. |
24 | Effective Optimistic-Checker Tandem Core Design through Architectural Pruning. |
20 | Global Multi-Threaded Instruction Scheduling. |
16 | Time Interpolation: So Many Metrics, So Few Registers. |
14 | Optimal versus Heuristic Global Code Scheduling. |
2006¶
Cited by | Paper title |
---|---|
941 | Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. |
603 | An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. |
552 | Die Stacking (3D) Microarchitecture. |
370 | LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. |
358 | Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. |
334 | Fair Queuing Memory Systems. |
302 | Leveraging Optical Technology in Future Bus-based Chip Multiprocessors. |
299 | ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers. |
282 | Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management. |
237 | ASR: Adaptive Selective Replication for CMP Caches. |
209 | Architectural Support for Software Transactional Memory. |
180 | Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. |
159 | Reunion: Complexity-Effective Multicore Redundancy. |
137 | In-Network Cache Coherence. |
130 | A Predictive Performance Model for Superscalar Processors. |
128 | Mitigating the Impact of Process Variations on Processor Register Files and Execution Units. |
110 | Yield-Aware Cache Architectures. |
92 | Adaptive Caches: Effective Shaping of Cache Behavior to Workloads. |
90 | Memory Prefetching Using Adaptive Stream Detection. |
76 | Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. |
66 | NoSQ: Store-Load Communication without a Store Queue. |
60 | Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. |
59 | Coherence Ordering for Ring-based Chip Multiprocessors. |
58 | Fire-and-Forget: Load/Store Scheduling with No Store Queue at All. |
57 | Fairness and Throughput in Switch on Event Multithreading. |
50 | Scalable Cache Miss Handling for High Memory-Level Parallelism. |
48 | CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs. |
47 | Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units. |
44 | Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware. |
36 | Dataflow Predication. |
36 | Support for High-Frequency Streaming in CMPs. |
30 | PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection. |
28 | Authentication Control Point and Its Implications For Secure Processor Design. |
27 | Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths. |
26 | Merging Head and Tail Duplication for Convergent Hyperblock Formation. |
20 | A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design. |
18 | DMDC: Delayed Memory Dependence Checking through Age-Based Filtering. |
18 | Virtually Pipelined Network Memory. |
16 | Serialization-Aware Mini-Graphs: Performance with Fewer Resources. |
11 | Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection. |
11 | Memory Protection through Dynamic Access Control. |
5 | Data-Dependency Graph Transformations for Superblock Scheduling. |
2005¶
Cited by | Paper title |
---|---|
278 | Automatic Thread Extraction with Decoupled Software Pipelining. |
211 | A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance. |
156 | Stream Programming on General-Purpose Processors. |
115 | A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. |
97 | Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. |
91 | The TM3270 Media-Processor. |
77 | A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation. |
69 | Scalable Store-Load Forwarding via Store Queue Index Prediction. |
64 | The Cell Processor Architecture. |
61 | Shader Performance Analysis on a Modern GPU Architecture. |
57 | Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities. |
55 | Thermal Management of On-Chip Caches Through Power Density Minimization. |
52 | Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. |
52 | Improving Region Selection in Dynamic Optimization Systems. |
49 | Address-Indexed Memory Disambiguation and Store-to-Load Forwarding. |
48 | ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing. |
45 | Continuous Path and Edge Profiling. |
42 | “”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “ |
40 | A Criticality Analysis of Clustering in Superscalar Processors. |
38 | Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. |
35 | Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. |
34 | uComplexity: Estimating Processor Design Effort. |
34 | Store Memory-Level Parallelism Optimizations for Commercial Applications. |
30 | Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. |
29 | How to Fake 1000 Registers. |
25 | Exploiting Vector Parallelism in Software Pipelined Loops. |
21 | Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines. |
14 | Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows. |
10 | The Future Evolution of High-Performance Microprocessors. |
10 | Efficient Use of Invisible Registers in Thumb Code. |
6 | Incremental Commit Groups for Non-Atomic Trace Processing. |
0 | Message from the General Chairs. |
0 | Message from the Program Co-Chairs. |