HPCA¶
All¶
Cited by | Paper title | Year |
---|---|---|
1235 | Amdahl’s Law in the multicore era. | 2008 |
1022 | Evaluating MapReduce for Multi-core and Multiprocessor Systems. | 2007 |
770 | LogTM: log-based transactional memory. | 2006 |
617 | System level analysis of fast, per-core DVFS using on-chip switching regulators. | 2008 |
616 | Unbounded Transactional Memory. | 2005 |
589 | Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. | 2005 |
499 | Power Efficient Processor Architecture and The Cell Processor. | 2005 |
395 | The Soft Error Problem: An Architectural Perspective. | 2005 |
386 | Graphite: A distributed parallel simulator for multicores. | 2010 |
373 | LogTM-SE: Decoupling Hardware Transactional Memory from Caches. | 2007 |
348 | A novel architecture of the 3D stacked MRAM L2 cache for CMPs. | 2009 |
336 | Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. | 2008 |
318 | ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. | 2010 |
315 | Regional congestion awareness for load balance in networks-on-chip. | 2008 |
266 | Dynamic power-performance adaptation of parallel computation on chip multiprocessors. | 2006 |
262 | Relaxing non-volatility for fast and energy-efficient STT-RAM caches. | 2011 |
243 | Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. | 2007 |
231 | CMP network-on-chip overlaid with multi-band RF-interconnect. | 2008 |
229 | Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing. | 2010 |
229 | A quantitative performance analysis model for GPU architectures. | 2011 |
213 | An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. | 2010 |
211 | Cluster-level feedback power control for performance optimization. | 2008 |
206 | BigDataBench: A big data benchmark suite from internet services. | 2014 |
198 | Chip Multithreading: Opportunities and Challenges. | 2005 |
191 | High performance network virtualization with SR-IOV. | 2010 |
190 | Concurrent Direct Network Access for Virtual Machine Monitors. | 2007 |
183 | Performance, Energy, and Thermal Considerations for SMT and CMP Architectures. | 2005 |
183 | Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. | 2009 |
177 | SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs. | 2005 |
177 | Dynamically Specialized Datapaths for energy efficient computing. | 2011 |
176 | BulletProof: a defect-tolerant CMP switch architecture. | 2006 |
170 | Express Cube Topologies for on-Chip Interconnects. | 2009 |
169 | CMP design space exploration subject to physical constraints. | 2006 |
166 | Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. | 2011 |
164 | Construction and use of linear regression models for processor performance analysis. | 2006 |
159 | Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors. | 2007 |
154 | Thread block compaction for efficient SIMT control flow. | 2011 |
147 | A Scalable, Non-blocking Approach to Transactional Memory. | 2007 |
144 | Application-Level Correctness and its Impact on Fault Tolerance. | 2007 |
143 | Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads. | 2006 |
143 | FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar. | 2010 |
141 | An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. | 2007 |
140 | I-CASH: Intelligently Coupled Array of SSD and HDD. | 2011 |
138 | FlexiTaint: A programmable accelerator for dynamic taint propagation. | 2008 |
137 | HARD: Hardware-Assisted Lockset-based Race Detection. | 2007 |
136 | A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks. | 2005 |
136 | FREE-p: Protecting non-volatile memory against both hard and soft errors. | 2011 |
132 | A comprehensive approach to DRAM power management. | 2008 |
131 | Operating system support for overlapping-ISA heterogeneous multi-core architectures. | 2010 |
131 | CHIPPER: A low-complexity bufferless deflection router. | 2011 |
130 | Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. | 2013 |
127 | Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM. | 2006 |
125 | The common case transactional behavior of multithreaded programs. | 2006 |
125 | A Burst Scheduling Access Reordering Mechanism. | 2007 |
123 | Adaptive Spill-Receive for robust high-performance caching in CMPs. | 2009 |
120 | Application performance modeling in a virtualized environment. | 2010 |
119 | Variation-aware dynamic voltage/frequency scaling. | 2009 |
117 | Transition Phase Classification and Prediction. | 2005 |
117 | Computational sprinting. | 2012 |
115 | Scalable architectural support for trusted software. | 2010 |
113 | Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications. | 2007 |
112 | Phase characterization for power: evaluating control-flow-based and event-counter-based techniques. | 2006 |
112 | A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement. | 2010 |
111 | Characterizing and Comparing Prevailing Simulation Techniques. | 2005 |
110 | Cuckoo directory: A scalable directory for many-core systems. | 2011 |
110 | Beyond block I/O: Rethinking traditional storage primitives. | 2011 |
109 | Improving write operations in MLC phase change memory. | 2012 |
107 | C-Oracle: Predictive thermal management for data centers. | 2008 |
107 | Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. | 2013 |
106 | Elastic-buffer flow control for on-chip networks. | 2009 |
105 | Perturbation-based Fault Screening. | 2007 |
104 | Improving Multiple-CMP Systems Using Token Coherence. | 2005 |
103 | Uncovering hidden loop level parallelism in sequential applications. | 2008 |
103 | Designing a processor from the ground up to allow voltage/reliability tradeoffs. | 2010 |
103 | Tiered-latency DRAM: A low latency and low cost DRAM architecture. | 2013 |
102 | MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging. | 2007 |
102 | Balancing DRAM locality and parallelism in shared memory CMP systems. | 2012 |
97 | Illustrative Design Space Studies with Microarchitectural Regression Models. | 2007 |
97 | Compute Caches. | 2017 |
95 | Eliminating microarchitectural dependency from Architectural Vulnerability. | 2009 |
95 | Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. | 2009 |
94 | Prediction router: Yet another low latency on-chip router architecture. | 2009 |
92 | The case for GPGPU spatial multitasking. | 2012 |
91 | A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. | 2005 |
91 | CORD: cost-effective (and nearly overhead-free) order-recording and data race detection. | 2006 |
91 | Interval simulation: Raising the level of abstraction in architectural simulation. | 2010 |
90 | Checkpointed Early Load Retirement. | 2005 |
90 | CHOP: Adaptive filter-based DRAM caching for CMP server platforms. | 2010 |
89 | PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. | 2009 |
89 | Accurate microarchitecture-level fault modeling for studying hardware faults. | 2009 |
88 | Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. | 2009 |
88 | SCD: A scalable coherence directory with flexible sharer set encoding. | 2012 |
88 | TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. | 2012 |
86 | DMA-aware memory energy management. | 2006 |
86 | Exploiting parallelism and structure to accelerate the simulation of chip multi-processors. | 2006 |
86 | ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers. | 2006 |
86 | A low-radix and low-diameter 3D interconnection network design. | 2009 |
86 | CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. | 2009 |
84 | A Unified Compressed Memory Hierarchy. | 2005 |
84 | Performance and power optimization through data compression in Network-on-Chip architectures. | 2008 |
84 | Blueshift: Designing processors for timing speculation from the ground up. | 2009 |
83 | Towards scalable, energy-efficient, bus-based on-chip networks. | 2010 |
81 | Trends in High-Performance Processors. | 2005 |
81 | Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance. | 2010 |
80 | MISE: Providing performance predictability and improving fairness in shared main memory systems. | 2013 |
80 | Cache coherence for GPU architectures. | 2013 |
79 | Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling. | 2007 |
79 | CPU-assisted GPGPU on fused CPU-GPU architectures. | 2012 |
79 | High-performance and energy-efficient mobile web browsing on big/little systems. | 2013 |
78 | Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors. | 2005 |
77 | Understanding the performance-temperature interactions in disk I/O of server workloads. | 2006 |
77 | High performance file I/O for the Blue Gene/L supercomputer. | 2006 |
76 | A first-order fine-grained multithreaded throughput model. | 2009 |
76 | SolarCore: Solar energy driven multi-core architecture power management. | 2011 |
76 | ESESC: A fast multicore simulator using Time-Based Sampling. | 2013 |
75 | Distributing the Frontend for Temperature Reduction. | 2005 |
75 | HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. | 2011 |
74 | DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors. | 2008 |
73 | Bridging the computation gap between programmable processors and hardwired accelerators. | 2009 |
73 | Calvin: Deterministic or not? Free will to choose. | 2011 |
73 | MRPB: Memory request prioritization for massively parallel processors. | 2014 |
72 | Shared last-level TLBs for chip multiprocessors. | 2011 |
72 | Runnemede: An architecture for Ubiquitous High-Performance Computing. | 2013 |
72 | Accelerating write by exploiting PCM asymmetries. | 2013 |
68 | CloudCache: Expanding and shrinking private caches. | 2011 |
68 | Improving DRAM performance by parallelizing refreshes with accesses. | 2014 |
67 | SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors. | 2005 |
66 | Voltage emergency prediction: Using signatures to reduce operating margins. | 2009 |
65 | A Memory-Level Parallelism Aware Fetch Policy for SMT Processors. | 2007 |
65 | An OS-based alternative to full hardware coherence on tiled CMPs. | 2008 |
65 | Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. | 2009 |
65 | In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. | 2009 |
65 | Warped register file: A power efficient register file for GPGPUs. | 2013 |
64 | On the Limits of Leakage Power Reduction in Caches. | 2005 |
64 | Interactions Between Compression and Prefetching in Chip Multiprocessors. | 2007 |
64 | Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines. | 2007 |
64 | Automated microprocessor stressmark generation. | 2008 |
63 | An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. | 2007 |
63 | Hardware-software integrated approaches to defend against software cache-based side channel attacks. | 2009 |
62 | Addressing system-level trimming issues in on-chip nanophotonic networks. | 2011 |
62 | A case for guarded power gating for multi-core processors. | 2011 |
61 | Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors. | 2005 |
61 | A Small, Fast and Low-Power Register File by Bit-Partitioning. | 2005 |
61 | Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system. | 2011 |
60 | Practical and secure PCM systems by online detection of malicious write streams. | 2011 |
60 | Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip. | 2012 |
59 | Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. | 2005 |
59 | Worth their watts? - an empirical study of datacenter servers. | 2010 |
59 | Reducing GPU offload latency via fine-grained CPU-GPU synchronization. | 2013 |
58 | Efficient scrub mechanisms for error-prone emerging memories. | 2012 |
57 | Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat. | 2007 |
56 | MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. | 2009 |
56 | Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics. | 2009 |
56 | Programming the cloud. | 2011 |
56 | Improving GPGPU resource utilization through alternative thread block scheduling. | 2014 |
55 | Application-to-core mapping policies to reduce memory system interference in multi-core systems. | 2013 |
54 | Enterprise IT Trends and Implications for Architecture Research. | 2005 |
54 | Archipelago: A polymorphic cache design for enabling robust near-threshold operation. | 2011 |
54 | Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. | 2012 |
53 | Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. | 2005 |
53 | An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. | 2005 |
53 | Navigating heterogeneous processors with market mechanisms. | 2013 |
52 | iCFP: Tolerating all-level cache misses in in-order processors. | 2009 |
52 | Overcoming the challenges of crossbar resistive memory architectures. | 2015 |
51 | Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks. | 2013 |
50 | A case for Refresh Pausing in DRAM memory systems. | 2013 |
50 | Breaking the on-chip latency barrier using SMART. | 2013 |
50 | Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. | 2015 |
49 | InfoShield: a security architecture for protecting information usage in memory. | 2006 |
49 | Design and implementation of the blue gene/P snoop filter. | 2008 |
49 | Design and implementation of software-managed caches for multicores with local memory. | 2009 |
49 | Energy-efficient interconnect via Router Parking. | 2013 |
49 | Power-performance co-optimization of throughput core architecture using resistive memory. | 2013 |
49 | Architecture exploration for ambient energy harvesting nonvolatile processors. | 2015 |
48 | NUcache: An efficient multicore cache organization based on Next-Use distance. | 2011 |
48 | QuickIA: Exploring heterogeneous architectures on real prototypes. | 2012 |
48 | SNNAP: Approximate computing on programmable SoCs via neural acceleration. | 2015 |
47 | Scatter-Add in Data Parallel Architectures. | 2005 |
47 | Thread-safe dynamic binary translation using transactional memory. | 2008 |
47 | Dacota: Post-silicon validation of the memory subsystem in multi-core designs. | 2009 |
47 | Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. | 2012 |
47 | Optimizing virtual machine scheduling in NUMA multicore systems. | 2013 |
47 | i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations. | 2013 |
46 | Adaptive placement and migration policy for an STT-RAM-based hybrid cache. | 2014 |
45 | An approach for implementing efficient superscalar CISC processors. | 2006 |
45 | A new server I/O architecture for high speed networks. | 2011 |
44 | Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols. | 2011 |
44 | Fast thread migration via cache working set prediction. | 2011 |
44 | Dynamically heterogeneous cores through 3D resource pooling. | 2012 |
44 | Enabling distributed generation powered sustainable high-performance data center. | 2013 |
44 | A detailed GPU cache model based on reuse distance theory. | 2014 |
43 | Heat Stroke: Power-Density-Based Denial of Service in SMT. | 2005 |
43 | Coset coding to extend the lifetime of memory. | 2013 |
43 | NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. | 2015 |
42 | EXCES: External caching in energy saving storage systems. | 2008 |
42 | Simple virtual channel allocation for high throughput and high frequency on-chip routers. | 2010 |
42 | Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. | 2012 |
41 | Colorama: Architectural Support for Data-Centric Synchronization. | 2007 |
41 | MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy. | 2011 |
41 | Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications. | 2012 |
41 | Staged Reads: Mitigating the impact of DRAM writes on DRAM reads. | 2012 |
41 | EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing. | 2013 |
41 | DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture. | 2014 |
41 | Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs. | 2014 |
41 | Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. | 2015 |
40 | Accelerating and Adapting Precomputation Threads for Effcient Prefetching. | 2007 |
40 | Improving cache performance using read-write partitioning. | 2014 |
40 | MemZip: Exploring unconventional benefits from memory compression. | 2014 |
40 | Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. | 2015 |
39 | Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping. | 2007 |
39 | ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture. | 2010 |
39 | Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. | 2011 |
39 | AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. | 2012 |
39 | The dual-path execution model for efficient GPU control flow. | 2013 |
38 | A Domain-Specific On-Chip Network Design for Large Scale Cache Systems. | 2007 |
38 | Runtime validation of memory ordering using constraint graph checking. | 2008 |
37 | Efficient complex operators for irregular codes. | 2011 |
37 | System-level implications of disaggregated memory. | 2012 |
37 | Timing channel protection for a shared memory controller. | 2014 |
37 | Supporting x86-64 address translation for 100s of GPU lanes. | 2014 |
36 | Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures. | 2007 |
36 | An intelligent IT infrastructure for the future. | 2009 |
36 | Optimizing Google’s warehouse scale computers: The NUMA experience. | 2013 |
36 | Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. | 2013 |
35 | A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures. | 2007 |
35 | GPGPU performance and power estimation using machine learning. | 2015 |
34 | Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems. | 2005 |
34 | Supporting highly-decoupled thread-level redundancy for parallel programs. | 2008 |
34 | Reconciling specialization and flexibility through compound circuits. | 2009 |
34 | Fast complete memory consistency verification. | 2009 |
34 | Abstraction and microarchitecture scaling in early-stage power modeling. | 2011 |
34 | Disintegrated control for energy-efficient and heterogeneous memory systems. | 2013 |
34 | QuickRelease: A throughput-oriented approach to release consistency on GPUs. | 2014 |
34 | Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications. | 2015 |
33 | A decoupled KILO-instruction processor. | 2006 |
33 | Characterization of Direct Cache Access on multi-core systems and 10GbE. | 2009 |
33 | Power-efficient computing for compute-intensive GPGPU applications. | 2013 |
33 | Increasing TLB reach by exploiting clustering in page translations. | 2014 |
32 | A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems. | 2010 |
32 | Bloom Filter Guided Transaction Scheduling. | 2011 |
32 | Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies. | 2013 |
31 | Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. | 2006 |
31 | Fundamental performance constraints in horizontal fusion of in-order cores. | 2008 |
30 | UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. | 2010 |
30 | Hardware/software techniques for DRAM thermal management. | 2011 |
30 | Achieving uniform performance and maximizing throughput in the presence of heterogeneity. | 2011 |
30 | Efficient data streaming with on-chip accelerators: Opportunities and challenges. | 2011 |
30 | Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. | 2015 |
29 | DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. | 2010 |
29 | SCRAP: Architecture for signature-based protection from Code Reuse Attacks. | 2013 |
29 | Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules. | 2014 |
29 | Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning. | 2014 |
29 | Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers. | 2014 |
29 | Warp-level divergence in GPUs: Characterization, impact, and mitigation. | 2014 |
28 | Multithreaded Value Prediction. | 2005 |
28 | Increasing the cache efficiency by eliminating noise. | 2006 |
28 | Practical off-chip meta-data for temporal memory streaming. | 2009 |
28 | Explaining cache SER anomaly using DUE AVF measurement. | 2010 |
28 | MORSE: Multi-objective reconfigurable self-optimizing memory scheduler. | 2012 |
28 | Modeling performance variation due to cache sharing. | 2013 |
28 | Scaling towards kilo-core processors with asymmetric high-radix topologies. | 2013 |
28 | Dynamic management of TurboMode in modern multi-core chips. | 2014 |
28 | TSO-CC: Consistency directed cache coherence for TSO. | 2014 |
27 | Software Directed Issue Queue Power Reduction. | 2005 |
27 | Efficient instruction schedulers for SMT processors. | 2006 |
27 | Single-level integrity and confidentiality protection for distributed shared memory multiprocessors. | 2008 |
27 | ACCESS: Smart scheduling for asymmetric cache CMPs. | 2011 |
27 | Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors. | 2011 |
26 | Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE. | 2005 |
26 | Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses. | 2005 |
26 | High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation. | 2008 |
26 | Power-Efficient DRAM Speculation. | 2008 |
26 | Power shifting in Thrifty Interconnection Network. | 2011 |
26 | JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers. | 2012 |
26 | Statistical performance comparisons of computers. | 2012 |
26 | Exploiting thermal energy storage to reduce data center capital and operating expenses. | 2014 |
26 | Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. | 2015 |
26 | Exploiting compressed block size as an indicator of future reuse. | 2015 |
26 | Coordinated static and dynamic cache bypassing for GPUs. | 2015 |
26 | Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. | 2015 |
25 | Optical Interconnect Opportunities for Future Server Memory Systems. | 2007 |
25 | Data-triggered threads: Eliminating redundant computation. | 2011 |
25 | MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. | 2014 |
25 | Mascar: Speeding up GPU warps by reducing memory pitstops. | 2015 |
25 | CATalyst: Defeating last-level cache side channel attacks in cloud computing. | 2016 |
24 | Exploiting Postdominance for Speculative Parallelization. | 2007 |
24 | Address-branch correlation: A novel locality for long-latency hard-to-predict branches. | 2008 |
24 | BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution. | 2010 |
24 | ?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory. | 2012 |
24 | Layout-conscious random topologies for HPC off-chip interconnects. | 2013 |
24 | ECM: Effective Capacity Maximizer for high-performance compressed caching. | 2013 |
24 | NUAT: A non-uniform access time memory controller. | 2014 |
24 | Quantifying sources of error in McPAT and potential impacts on architectural studies. | 2015 |
24 | Power punch: Towards non-blocking power-gating of NoC routers. | 2015 |
23 | Completely verifying memory consistency of test program executions. | 2006 |
23 | Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation. | 2008 |
23 | Architectural Contesting. | 2009 |
23 | QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers. | 2014 |
23 | ChargeCache: Reducing DRAM latency by exploiting row access locality. | 2016 |
22 | Value Based BTB Indexing for indirect jump prediction. | 2010 |
22 | Storage free confidence estimation for the TAGE branch predictor. | 2011 |
22 | Power balanced pipelines. | 2012 |
22 | Network congestion avoidance through Speculative Reservation. | 2012 |
22 | Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. | 2016 |
21 | Software-hardware cooperative memory disambiguation. | 2006 |
21 | Improving Branch Prediction and Predicated Execution in Out-of-Order Processors. | 2007 |
21 | Runahead Threads to improve SMT performance. | 2008 |
21 | Decoupled dynamic cache segmentation. | 2012 |
21 | Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. | 2015 |
20 | Store vectors for scalable memory dependence prediction and scheduling. | 2006 |
20 | Roughness of microarchitectural design topologies and its implications for optimization. | 2008 |
20 | Offline symbolic analysis to infer Total Store Order. | 2011 |
20 | MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets. | 2012 |
20 | Cost effective data center servers. | 2013 |
20 | XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. | 2015 |
19 | Feedback mechanisms for improving probabilistic memory prefetching. | 2009 |
19 | Soft error vulnerability aware process variation mitigation. | 2009 |
19 | IADVS: On-demand performance for interactive applications. | 2010 |
19 | HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. | 2011 |
19 | Reducing the cost of persistence for nonvolatile heaps in end user devices. | 2014 |
19 | Concurrent and consistent virtual machine introspection with hardware transactional memory. | 2014 |
19 | CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture. | 2014 |
19 | Stash directory: A scalable directory for many-core coherence. | 2014 |
19 | Priority-based cache allocation in throughput processors. | 2015 |
19 | Prediction-based superpage-friendly TLB designs. | 2015 |
19 | Unlocking bandwidth for GPUs in CC-NUMA systems. | 2015 |
19 | Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. | 2016 |
18 | Prediction of CPU idle-busy activity pattern. | 2008 |
18 | Checked Load: Architectural support for JavaScript type-checking on mobile processors. | 2011 |
18 | WEST: Cloning data cache behavior using Stochastic Traces. | 2012 |
18 | Supporting efficient collective communication in NoCs. | 2012 |
18 | Pacman: Tolerating asymmetric data races with unintrusive hardware. | 2012 |
18 | Improving multi-core performance using mixed-cell cache architecture. | 2013 |
18 | Worm-Bubble Flow Control. | 2013 |
18 | Sprinkler: Maximizing resource utilization in many-chip solid state disks. | 2014 |
18 | PVCoherence: Designing flat coherence protocols for scalable verification. | 2014 |
18 | Supporting superpages in non-contiguous physical memory. | 2015 |
18 | BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing. | 2015 |
17 | Probabilistic counter updates for predictor hysteresis and stratification. | 2006 |
17 | LiteTM: Reducing transactional state overhead. | 2010 |
17 | Locality-aware data replication in the Last-Level Cache. | 2014 |
17 | Spare register aware prefetching for graph algorithms on GPUs. | 2014 |
17 | Implications of high energy proportional servers on cluster-wide energy proportionality. | 2014 |
17 | Practical data value speculation for future high-end processors. | 2014 |
17 | Talus: A simple way to remove cliffs in cache performance. | 2015 |
17 | Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. | 2015 |
16 | Criticality-based optimizations for efficient load processing. | 2009 |
16 | SIF: Overcoming the limitations of SIMD devices via implicit permutation. | 2010 |
16 | StimulusCache: Boosting performance of chip multiprocessors with excess cache. | 2010 |
16 | Delay-Hiding energy management mechanisms for DRAM. | 2010 |
16 | Network within a network approach to create a scalable high-radix router microarchitecture. | 2012 |
16 | Tag tables. | 2015 |
15 | Parabix: Boosting the efficiency of text processing on commodity processors. | 2012 |
15 | Cache restoration for highly partitioned virtualized systems. | 2012 |
15 | Exploring high-performance and energy proportional interface for phase change memory systems. | 2013 |
15 | Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. | 2014 |
15 | A scalable multi-path microarchitecture for efficient GPU control flow. | 2014 |
15 | CAFO: Cost aware flip optimization for asymmetric memories. | 2015 |
15 | Malware-aware processors: A framework for efficient online malware detection. | 2015 |
14 | Implications of Device Timing Variability on Full Chip Timing. | 2007 |
14 | PEEP: Exploiting predictability of memory dependences in SMT processors. | 2008 |
14 | Adaptive Reliability Chipkill Correct (ARCC). | 2013 |
14 | Precision-aware soft error protection for GPUs. | 2014 |
14 | Revolver: Processor architecture for power efficient loop execution. | 2014 |
14 | Understanding contention-based channels and using them for defense. | 2015 |
13 | Exascale computing: The challenges and opportunities in the next decade. | 2010 |
13 | Accelerating business analytics applications. | 2012 |
13 | Undersubscribed threading on clustered cache architectures. | 2014 |
13 | Domain knowledge based energy management in handhelds. | 2015 |
13 | Paying to save: Reducing cost of colocation data center via rewards. | 2015 |
13 | Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. | 2016 |
12 | Chip-multiprocessing and beyond. | 2006 |
12 | PaCo: Probability-based path confidence prediction. | 2008 |
12 | Adaptive Set-Granular Cooperative Caching. | 2012 |
12 | TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network. | 2013 |
12 | Dynamically detecting and tolerating IF-Condition Data Races. | 2014 |
12 | DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. | 2014 |
12 | Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management. | 2014 |
12 | Scaling distributed cache hierarchies through computation and data co-scheduling. | 2015 |
11 | Tapping ZettaRAMTMfor Low-Power Memory Systems. | 2005 |
11 | Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions. | 2005 |
11 | Skinflint DRAM system: Minimizing DRAM chip writes for low power. | 2013 |
11 | Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling. | 2013 |
11 | Accordion: Toward soft Near-Threshold Voltage Computing. | 2014 |
11 | 3D stacking of high-performance processors. | 2014 |
11 | Augmenting low-latency HPC network with free-space optical links. | 2015 |
11 | TABLA: A unified template-based framework for accelerating statistical machine learning. | 2016 |
10 | Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again. | 2008 |
10 | Speculative instruction validation for performance-reliability trade-off. | 2008 |
10 | COMIC++: A software SVM system for heterogeneous multicore accelerator clusters. | 2010 |
10 | BulkSMT: Designing SMT processors for atomic-block execution. | 2012 |
10 | Illusionist: Transforming lightweight cores into aggressive cores on demand. | 2013 |
10 | Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches. | 2013 |
10 | STM: Cloning the spatial and temporal memory access behavior. | 2014 |
10 | Strategies for anticipating risk in heterogeneous system design. | 2014 |
10 | Overcoming far-end congestion in large-scale networks. | 2015 |
10 | Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. | 2016 |
10 | Energy-efficient address translation. | 2016 |
9 | Exploiting criticality to reduce bottlenecks in distributed uniprocessors. | 2011 |
9 | Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model. | 2013 |
9 | RECAP: A region-based cure for the common cold (cache). | 2013 |
9 | SCOC: High-radix switches made of bufferless clos networks. | 2015 |
9 | FTXen: Making hypervisor resilient to hardware faults on relaxed cores. | 2015 |
9 | Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. | 2016 |
9 | A performance analysis framework for optimizing OpenCL applications on FPGAs. | 2016 |
9 | HRL: Efficient and flexible reconfigurable logic for near-data processing. | 2016 |
8 | Performance-aware speculation control using wrong path usefulness prediction. | 2008 |
8 | Handling branches in TLS systems with Multi-Path Execution. | 2010 |
8 | Hardware/software-based diagnosis of load-store queues using expandable activity logs. | 2011 |
8 | Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons. | 2013 |
8 | A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks. | 2014 |
8 | Reducing read latency of phase change memory via early read and Turbo Read. | 2015 |
8 | Warped-preexecution: A GPU pre-execution approach for improving latency hiding. | 2016 |
8 | A case for toggle-aware compression for GPU systems. | 2016 |
7 | Architectural support for synchronization-free deterministic parallel programming. | 2012 |
7 | A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O. | 2013 |
7 | A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. | 2013 |
7 | Two level bulk preload branch prediction. | 2013 |
7 | High-speed formal verification of heterogeneous coherence hierarchies. | 2013 |
7 | Understanding the impact of gate-level physical reliability effects on whole program execution. | 2014 |
7 | Atomic SC for simple in-order processors. | 2014 |
7 | Transportation-network-inspired network-on-chip. | 2014 |
7 | FADE: A programmable filtering accelerator for instruction-grain monitoring. | 2014 |
7 | Exploring architectural heterogeneity in intelligent vision systems. | 2015 |
7 | GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures. | 2015 |
7 | BeBoP: A cost effective predictor infrastructure for superscalar value prediction. | 2015 |
7 | Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems. | 2015 |
7 | Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. | 2016 |
7 | A large-scale study of soft-errors on GPUs in the field. | 2016 |
7 | Atomic persistence for SCM with a non-intrusive backend controller. | 2016 |
6 | High-Performance low-vcc in-order core. | 2010 |
6 | Flexible register management using reference counting. | 2012 |
6 | In-network traffic regulation for Transactional Memory. | 2013 |
6 | iPatch: Intelligent fault patching to improve energy efficiency. | 2015 |
6 | Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability. | 2015 |
6 | Balancing reliability, cost, and performance tradeoffs with FreeFault. | 2015 |
6 | Selective GPU caches to eliminate CPU-GPU HW cache coherence. | 2016 |
5 | Speculative synchronization and thread management for fine granularity threads. | 2006 |
5 | Fabric convergence implications on systems architecture. | 2008 |
5 | HARE: Hardware assisted reverse execution. | 2010 |
5 | DMA++: on the fly data realignment for on-chip memories. | 2010 |
5 | Fg-STP: Fine-Grain Single Thread Partitioning on Multicores. | 2011 |
5 | Architectural framework for supporting operating system survivability. | 2011 |
5 | A group-commit mechanism for ROB-based processors implementing the X86 ISA. | 2013 |
5 | Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs. | 2014 |
5 | CDTT: Compiler-generated data-triggered threads. | 2014 |
5 | Scalably verifiable dynamic power management. | 2014 |
5 | GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. | 2014 |
5 | High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. | 2015 |
5 | Increasing multicore system efficiency through intelligent bandwidth shifting. | 2015 |
5 | “Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “ | 2015 |
5 | CiDRA: A cache-inspired DRAM resilience architecture. | 2015 |
5 | Scalable communication architecture for network-attached accelerators. | 2015 |
5 | VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors. | 2015 |
5 | Efficient footprint caching for Tagless DRAM Caches. | 2016 |
5 | A complete key recovery timing attack on a GPU. | 2016 |
5 | McVerSi: A test generation framework for fast memory consistency verification in simulation. | 2016 |
5 | Pushing the limits of accelerator efficiency while retaining programmability. | 2016 |
5 | Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller. | 2016 |
5 | Restore truncation for performance improvement in future DRAM systems. | 2016 |
5 | Modeling cache performance beyond LRU. | 2016 |
5 | SLaC: Stage laser control for a flattened butterfly network. | 2016 |
4 | Interconnect-Centric Computing. | 2007 |
4 | Branch-mispredict level parallelism (BLP) for control independence. | 2008 |
4 | LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores. | 2010 |
4 | BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks. | 2012 |
4 | Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor. | 2012 |
4 | How to implement effective prediction and forwarding for fusable dynamic multicore architectures. | 2013 |
4 | Correction prediction: Reducing error correction latency for on-chip memories. | 2015 |
4 | CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM. | 2016 |
4 | ScalCore: Designing a core for voltage scalability. | 2016 |
4 | Best-offset hardware prefetching. | 2016 |
4 | Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines. | 2016 |
4 | Towards high performance paged memory for GPUs. | 2016 |
4 | SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies. | 2017 |
4 | Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques. | 2017 |
3 | Petascale Computing Research Challenges - A Manycore Perspective. | 2007 |
3 | Lightweight predication support for out of order processors. | 2009 |
3 | MOPED: Orchestrating interprocess message data on CMPs. | 2011 |
3 | Safe and efficient supervised memory systems. | 2011 |
3 | Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee. | 2016 |
3 | LASER: Light, Accurate Sharing dEtection and Repair. | 2016 |
3 | A low power software-defined-radio baseband processor for the Internet of Things. | 2016 |
3 | Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems. | 2016 |
3 | Symbiotic job scheduling on the IBM POWER8. | 2016 |
3 | MaPU: A novel mathematical computing architecture. | 2016 |
3 | Transparent and Efficient CFI Enforcement with Intel Processor Trace. | 2017 |
2 | Industrial Perspectives: Platform Design Challenges with Many cores. | 2006 |
2 | Opportunities beyond single-core microprocessors. | 2009 |
2 | Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach. | 2014 |
2 | Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. | 2015 |
2 | Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems. | 2015 |
2 | Approximating warps with intra-warp operand value similarity. | 2016 |
2 | Software transparent dynamic binary translation for coarse-grain reconfigurable architectures. | 2016 |
2 | Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors. | 2016 |
2 | Cost effective physical register sharing. | 2016 |
2 | A low-power hybrid reconfigurable architecture for resistive random-access memories. | 2016 |
2 | LiveSim: Going live with microarchitecture simulation. | 2016 |
2 | Core tunneling: Variation-aware voltage noise mitigation in GPUs. | 2016 |
2 | Venice: Exploring server architectures for effective resource sharing. | 2016 |
2 | PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. | 2017 |
1 | Architecting for power management: The IBM POWER7TMapproach. | 2010 |
1 | Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors. | 2013 |
1 | Low-overhead and high coverage run-time race detection through selective meta-data management. | 2014 |
1 | DVFS for NoCs in CMPs: A thread voting approach. | 2016 |
1 | DUANG: Fast and lightweight page migration in asymmetric memory systems. | 2016 |
1 | PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory. | 2016 |
1 | Minimal disturbance placement and promotion. | 2016 |
1 | iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs. | 2016 |
1 | Efficient synthetic traffic models for large, complex SoCs. | 2016 |
1 | Efficient GPU hardware transactional memory through early conflict resolution. | 2016 |
1 | The runahead network-on-chip. | 2016 |
1 | RADAR: Runtime-assisted dead region management for last-level caches. | 2016 |
1 | SizeCap: Efficiently handling power surges in fuel cell powered data centers. | 2016 |
1 | A market approach for handling power emergencies in multi-tenant data center. | 2016 |
1 | Cooper: Task Colocation with Cooperative Games. | 2017 |
1 | Secure Dynamic Memory Scheduling Against Timing Channel Attacks. | 2017 |
1 | Controlled Kernel Launch for Dynamic Parallelism in GPUs. | 2017 |
1 | Exploring Hyperdimensional Associative Memory. | 2017 |
1 | SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization. | 2017 |
1 | ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging. | 2017 |
1 | MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories. | 2017 |
1 | Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs. | 2017 |
1 | Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. | 2017 |
1 | SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support. | 2017 |
1 | GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. | 2017 |
1 | Near-Ideal Networks-on-Chip for Servers. | 2017 |
0 | The Future of Computer Architecture Research: An Industrial Perspective. | 2005 |
0 | Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth. | 2006 |
0 | Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps. | 2006 |
0 | New architectures for a new biology. | 2006 |
0 | Intel’s Tera-scale Computing Project: The first five years, the next five years. | 2008 |
0 | Compilers and parallel computing systems. | 2008 |
0 | Industrial perspectives panel. | 2009 |
0 | Multi-core demands multi-interfaces. | 2009 |
0 | Is hardware innovation over? | 2010 |
0 | Extreme scale computing: Challenges and opportunities. | 2010 |
0 | How’s the parallel computing revolution going? | 2011 |
0 | Improving in-memory database index performance with Intel®Transactional Synchronization Extensions. | 2014 |
0 | Run-time monitoring with adjustable overhead using dataflow-guided filtering. | 2015 |
0 | Design and implementation of a mobile storage leveraging the DRAM interface. | 2016 |
0 | SCsafe: Logging sequential consistency violations continuously and precisely. | 2016 |
0 | PABST: Proportionally Allocated Bandwidth at the Source and Target. | 2017 |
0 | Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources. | 2017 |
0 | BRAVO: Balanced Reliability-Aware Voltage Optimization. | 2017 |
0 | Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads. | 2017 |
0 | Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links. | 2017 |
0 | Design and Analysis of an APU for Exascale Computing. | 2017 |
0 | Boomerang: A Metadata-Free Architecture for Control Flow Delivery. | 2017 |
0 | Partial Row Activation for Low-Power DRAM System. | 2017 |
0 | High-Bandwidth Low-Latency Approximate Interconnection Networks. | 2017 |
0 | Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence. | 2017 |
0 | Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies. | 2017 |
0 | Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings. | 2017 |
0 | Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks. | 2017 |
0 | Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors. | 2017 |
0 | Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor. | 2017 |
0 | Architecting an Energy-Efficient DRAM System for GPUs. | 2017 |
0 | Processing-in-Memory Enabled Graphics Processors for 3D Rendering. | 2017 |
0 | Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems. | 2017 |
0 | Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices. | 2017 |
0 | Random Folded Clos Topologies for Datacenter Networks. | 2017 |
0 | Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators. | 2017 |
0 | Enabling Effective Module-Oblivious Power Gating for Embedded Processors. | 2017 |
0 | Fast Decentralized Power Capping for Server Clusters. | 2017 |
0 | Maximizing Cache Performance Under Uncertainty. | 2017 |
0 | Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. | 2017 |
0 | Supporting Address Translation for Accelerator-Centric Architectures. | 2017 |
0 | G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. | 2017 |
0 | NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture. | 2017 |
0 | Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance. | 2017 |
0 | Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices. | 2017 |
0 | Pilot Register File: Energy Efficient Partitioned Register File for GPUs. | 2017 |
0 | FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. | 2017 |
0 | Reliability-Aware Scheduling on Heterogeneous Multicore Processors. | 2017 |
0 | KAML: A Flexible, High-Performance Key-Value SSD. | 2017 |
0 | A Split Cache Hierarchy for Enabling Data-Oriented Optimizations. | 2017 |
0 | Understanding and Optimizing Power Consumption in Memory Networks. | 2017 |
0 | SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures. | 2017 |
0 | Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking. | 2017 |
2017¶
Cited by | Paper title |
---|---|
97 | Compute Caches. |
4 | SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies. |
4 | Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques. |
3 | Transparent and Efficient CFI Enforcement with Intel Processor Trace. |
2 | PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. |
1 | Cooper: Task Colocation with Cooperative Games. |
1 | Secure Dynamic Memory Scheduling Against Timing Channel Attacks. |
1 | Controlled Kernel Launch for Dynamic Parallelism in GPUs. |
1 | Exploring Hyperdimensional Associative Memory. |
1 | SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization. |
1 | ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging. |
1 | MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories. |
1 | Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs. |
1 | Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. |
1 | SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support. |
1 | GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. |
1 | Near-Ideal Networks-on-Chip for Servers. |
0 | PABST: Proportionally Allocated Bandwidth at the Source and Target. |
0 | Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources. |
0 | BRAVO: Balanced Reliability-Aware Voltage Optimization. |
0 | Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads. |
0 | Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links. |
0 | Design and Analysis of an APU for Exascale Computing. |
0 | Boomerang: A Metadata-Free Architecture for Control Flow Delivery. |
0 | Partial Row Activation for Low-Power DRAM System. |
0 | High-Bandwidth Low-Latency Approximate Interconnection Networks. |
0 | Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence. |
0 | Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies. |
0 | Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings. |
0 | Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks. |
0 | Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors. |
0 | Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor. |
0 | Architecting an Energy-Efficient DRAM System for GPUs. |
0 | Processing-in-Memory Enabled Graphics Processors for 3D Rendering. |
0 | Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems. |
0 | Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices. |
0 | Random Folded Clos Topologies for Datacenter Networks. |
0 | Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators. |
0 | Enabling Effective Module-Oblivious Power Gating for Embedded Processors. |
0 | Fast Decentralized Power Capping for Server Clusters. |
0 | Maximizing Cache Performance Under Uncertainty. |
0 | Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. |
0 | Supporting Address Translation for Accelerator-Centric Architectures. |
0 | G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. |
0 | NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture. |
0 | Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance. |
0 | Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices. |
0 | Pilot Register File: Energy Efficient Partitioned Register File for GPUs. |
0 | FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. |
0 | Reliability-Aware Scheduling on Heterogeneous Multicore Processors. |
0 | KAML: A Flexible, High-Performance Key-Value SSD. |
0 | A Split Cache Hierarchy for Enabling Data-Oriented Optimizations. |
0 | Understanding and Optimizing Power Consumption in Memory Networks. |
0 | SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures. |
0 | Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking. |
2016¶
Cited by | Paper title |
---|---|
25 | CATalyst: Defeating last-level cache side channel attacks in cloud computing. |
23 | ChargeCache: Reducing DRAM latency by exploiting row access locality. |
22 | Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. |
19 | Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. |
13 | Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. |
11 | TABLA: A unified template-based framework for accelerating statistical machine learning. |
10 | Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. |
10 | Energy-efficient address translation. |
9 | Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. |
9 | A performance analysis framework for optimizing OpenCL applications on FPGAs. |
9 | HRL: Efficient and flexible reconfigurable logic for near-data processing. |
8 | Warped-preexecution: A GPU pre-execution approach for improving latency hiding. |
8 | A case for toggle-aware compression for GPU systems. |
7 | Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. |
7 | A large-scale study of soft-errors on GPUs in the field. |
7 | Atomic persistence for SCM with a non-intrusive backend controller. |
6 | Selective GPU caches to eliminate CPU-GPU HW cache coherence. |
5 | Efficient footprint caching for Tagless DRAM Caches. |
5 | A complete key recovery timing attack on a GPU. |
5 | McVerSi: A test generation framework for fast memory consistency verification in simulation. |
5 | Pushing the limits of accelerator efficiency while retaining programmability. |
5 | Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller. |
5 | Restore truncation for performance improvement in future DRAM systems. |
5 | Modeling cache performance beyond LRU. |
5 | SLaC: Stage laser control for a flattened butterfly network. |
4 | CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM. |
4 | ScalCore: Designing a core for voltage scalability. |
4 | Best-offset hardware prefetching. |
4 | Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines. |
4 | Towards high performance paged memory for GPUs. |
3 | Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee. |
3 | LASER: Light, Accurate Sharing dEtection and Repair. |
3 | A low power software-defined-radio baseband processor for the Internet of Things. |
3 | Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems. |
3 | Symbiotic job scheduling on the IBM POWER8. |
3 | MaPU: A novel mathematical computing architecture. |
2 | Approximating warps with intra-warp operand value similarity. |
2 | Software transparent dynamic binary translation for coarse-grain reconfigurable architectures. |
2 | Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors. |
2 | Cost effective physical register sharing. |
2 | A low-power hybrid reconfigurable architecture for resistive random-access memories. |
2 | LiveSim: Going live with microarchitecture simulation. |
2 | Core tunneling: Variation-aware voltage noise mitigation in GPUs. |
2 | Venice: Exploring server architectures for effective resource sharing. |
1 | DVFS for NoCs in CMPs: A thread voting approach. |
1 | DUANG: Fast and lightweight page migration in asymmetric memory systems. |
1 | PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory. |
1 | Minimal disturbance placement and promotion. |
1 | iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs. |
1 | Efficient synthetic traffic models for large, complex SoCs. |
1 | Efficient GPU hardware transactional memory through early conflict resolution. |
1 | The runahead network-on-chip. |
1 | RADAR: Runtime-assisted dead region management for last-level caches. |
1 | SizeCap: Efficiently handling power surges in fuel cell powered data centers. |
1 | A market approach for handling power emergencies in multi-tenant data center. |
0 | Design and implementation of a mobile storage leveraging the DRAM interface. |
0 | SCsafe: Logging sequential consistency violations continuously and precisely. |
2015¶
Cited by | Paper title |
---|---|
52 | Overcoming the challenges of crossbar resistive memory architectures. |
50 | Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. |
49 | Architecture exploration for ambient energy harvesting nonvolatile processors. |
48 | SNNAP: Approximate computing on programmable SoCs via neural acceleration. |
43 | NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. |
41 | Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. |
40 | Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. |
35 | GPGPU performance and power estimation using machine learning. |
34 | Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications. |
30 | Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. |
26 | Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. |
26 | Exploiting compressed block size as an indicator of future reuse. |
26 | Coordinated static and dynamic cache bypassing for GPUs. |
26 | Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. |
25 | Mascar: Speeding up GPU warps by reducing memory pitstops. |
24 | Quantifying sources of error in McPAT and potential impacts on architectural studies. |
24 | Power punch: Towards non-blocking power-gating of NoC routers. |
21 | Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. |
20 | XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. |
19 | Priority-based cache allocation in throughput processors. |
19 | Prediction-based superpage-friendly TLB designs. |
19 | Unlocking bandwidth for GPUs in CC-NUMA systems. |
18 | Supporting superpages in non-contiguous physical memory. |
18 | BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing. |
17 | Talus: A simple way to remove cliffs in cache performance. |
17 | Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. |
16 | Tag tables. |
15 | CAFO: Cost aware flip optimization for asymmetric memories. |
15 | Malware-aware processors: A framework for efficient online malware detection. |
14 | Understanding contention-based channels and using them for defense. |
13 | Domain knowledge based energy management in handhelds. |
13 | Paying to save: Reducing cost of colocation data center via rewards. |
12 | Scaling distributed cache hierarchies through computation and data co-scheduling. |
11 | Augmenting low-latency HPC network with free-space optical links. |
10 | Overcoming far-end congestion in large-scale networks. |
9 | SCOC: High-radix switches made of bufferless clos networks. |
9 | FTXen: Making hypervisor resilient to hardware faults on relaxed cores. |
8 | Reducing read latency of phase change memory via early read and Turbo Read. |
7 | Exploring architectural heterogeneity in intelligent vision systems. |
7 | GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures. |
7 | BeBoP: A cost effective predictor infrastructure for superscalar value prediction. |
7 | Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems. |
6 | iPatch: Intelligent fault patching to improve energy efficiency. |
6 | Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability. |
6 | Balancing reliability, cost, and performance tradeoffs with FreeFault. |
5 | High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. |
5 | Increasing multicore system efficiency through intelligent bandwidth shifting. |
5 | “Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “ |
5 | CiDRA: A cache-inspired DRAM resilience architecture. |
5 | Scalable communication architecture for network-attached accelerators. |
5 | VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors. |
4 | Correction prediction: Reducing error correction latency for on-chip memories. |
2 | Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. |
2 | Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems. |
0 | Run-time monitoring with adjustable overhead using dataflow-guided filtering. |
2014¶
Cited by | Paper title |
---|---|
206 | BigDataBench: A big data benchmark suite from internet services. |
73 | MRPB: Memory request prioritization for massively parallel processors. |
68 | Improving DRAM performance by parallelizing refreshes with accesses. |
56 | Improving GPGPU resource utilization through alternative thread block scheduling. |
46 | Adaptive placement and migration policy for an STT-RAM-based hybrid cache. |
44 | A detailed GPU cache model based on reuse distance theory. |
41 | DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture. |
41 | Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs. |
40 | Improving cache performance using read-write partitioning. |
40 | MemZip: Exploring unconventional benefits from memory compression. |
37 | Timing channel protection for a shared memory controller. |
37 | Supporting x86-64 address translation for 100s of GPU lanes. |
34 | QuickRelease: A throughput-oriented approach to release consistency on GPUs. |
33 | Increasing TLB reach by exploiting clustering in page translations. |
29 | Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules. |
29 | Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning. |
29 | Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers. |
29 | Warp-level divergence in GPUs: Characterization, impact, and mitigation. |
28 | Dynamic management of TurboMode in modern multi-core chips. |
28 | TSO-CC: Consistency directed cache coherence for TSO. |
26 | Exploiting thermal energy storage to reduce data center capital and operating expenses. |
25 | MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. |
24 | NUAT: A non-uniform access time memory controller. |
23 | QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers. |
19 | Reducing the cost of persistence for nonvolatile heaps in end user devices. |
19 | Concurrent and consistent virtual machine introspection with hardware transactional memory. |
19 | CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture. |
19 | Stash directory: A scalable directory for many-core coherence. |
18 | Sprinkler: Maximizing resource utilization in many-chip solid state disks. |
18 | PVCoherence: Designing flat coherence protocols for scalable verification. |
17 | Locality-aware data replication in the Last-Level Cache. |
17 | Spare register aware prefetching for graph algorithms on GPUs. |
17 | Implications of high energy proportional servers on cluster-wide energy proportionality. |
17 | Practical data value speculation for future high-end processors. |
15 | Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. |
15 | A scalable multi-path microarchitecture for efficient GPU control flow. |
14 | Precision-aware soft error protection for GPUs. |
14 | Revolver: Processor architecture for power efficient loop execution. |
13 | Undersubscribed threading on clustered cache architectures. |
12 | Dynamically detecting and tolerating IF-Condition Data Races. |
12 | DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. |
12 | Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management. |
11 | Accordion: Toward soft Near-Threshold Voltage Computing. |
11 | 3D stacking of high-performance processors. |
10 | STM: Cloning the spatial and temporal memory access behavior. |
10 | Strategies for anticipating risk in heterogeneous system design. |
8 | A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks. |
7 | Understanding the impact of gate-level physical reliability effects on whole program execution. |
7 | Atomic SC for simple in-order processors. |
7 | Transportation-network-inspired network-on-chip. |
7 | FADE: A programmable filtering accelerator for instruction-grain monitoring. |
5 | Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs. |
5 | CDTT: Compiler-generated data-triggered threads. |
5 | Scalably verifiable dynamic power management. |
5 | GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. |
2 | Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach. |
1 | Low-overhead and high coverage run-time race detection through selective meta-data management. |
0 | Improving in-memory database index performance with Intel®Transactional Synchronization Extensions. |
2013¶
Cited by | Paper title |
---|---|
130 | Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. |
107 | Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. |
103 | Tiered-latency DRAM: A low latency and low cost DRAM architecture. |
80 | MISE: Providing performance predictability and improving fairness in shared main memory systems. |
80 | Cache coherence for GPU architectures. |
79 | High-performance and energy-efficient mobile web browsing on big/little systems. |
76 | ESESC: A fast multicore simulator using Time-Based Sampling. |
72 | Runnemede: An architecture for Ubiquitous High-Performance Computing. |
72 | Accelerating write by exploiting PCM asymmetries. |
65 | Warped register file: A power efficient register file for GPGPUs. |
59 | Reducing GPU offload latency via fine-grained CPU-GPU synchronization. |
55 | Application-to-core mapping policies to reduce memory system interference in multi-core systems. |
53 | Navigating heterogeneous processors with market mechanisms. |
51 | Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks. |
50 | A case for Refresh Pausing in DRAM memory systems. |
50 | Breaking the on-chip latency barrier using SMART. |
49 | Energy-efficient interconnect via Router Parking. |
49 | Power-performance co-optimization of throughput core architecture using resistive memory. |
47 | Optimizing virtual machine scheduling in NUMA multicore systems. |
47 | i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations. |
44 | Enabling distributed generation powered sustainable high-performance data center. |
43 | Coset coding to extend the lifetime of memory. |
41 | EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing. |
39 | The dual-path execution model for efficient GPU control flow. |
36 | Optimizing Google’s warehouse scale computers: The NUMA experience. |
36 | Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. |
34 | Disintegrated control for energy-efficient and heterogeneous memory systems. |
33 | Power-efficient computing for compute-intensive GPGPU applications. |
32 | Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies. |
29 | SCRAP: Architecture for signature-based protection from Code Reuse Attacks. |
28 | Modeling performance variation due to cache sharing. |
28 | Scaling towards kilo-core processors with asymmetric high-radix topologies. |
24 | Layout-conscious random topologies for HPC off-chip interconnects. |
24 | ECM: Effective Capacity Maximizer for high-performance compressed caching. |
20 | Cost effective data center servers. |
18 | Improving multi-core performance using mixed-cell cache architecture. |
18 | Worm-Bubble Flow Control. |
15 | Exploring high-performance and energy proportional interface for phase change memory systems. |
14 | Adaptive Reliability Chipkill Correct (ARCC). |
12 | TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network. |
11 | Skinflint DRAM system: Minimizing DRAM chip writes for low power. |
11 | Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling. |
10 | Illusionist: Transforming lightweight cores into aggressive cores on demand. |
10 | Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches. |
9 | Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model. |
9 | RECAP: A region-based cure for the common cold (cache). |
8 | Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons. |
7 | A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O. |
7 | A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. |
7 | Two level bulk preload branch prediction. |
7 | High-speed formal verification of heterogeneous coherence hierarchies. |
6 | In-network traffic regulation for Transactional Memory. |
5 | A group-commit mechanism for ROB-based processors implementing the X86 ISA. |
4 | How to implement effective prediction and forwarding for fusable dynamic multicore architectures. |
1 | Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors. |
2012¶
Cited by | Paper title |
---|---|
117 | Computational sprinting. |
109 | Improving write operations in MLC phase change memory. |
102 | Balancing DRAM locality and parallelism in shared memory CMP systems. |
92 | The case for GPGPU spatial multitasking. |
88 | SCD: A scalable coherence directory with flexible sharer set encoding. |
88 | TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. |
79 | CPU-assisted GPGPU on fused CPU-GPU architectures. |
60 | Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip. |
58 | Efficient scrub mechanisms for error-prone emerging memories. |
54 | Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. |
48 | QuickIA: Exploring heterogeneous architectures on real prototypes. |
47 | Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. |
44 | Dynamically heterogeneous cores through 3D resource pooling. |
42 | Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. |
41 | Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications. |
41 | Staged Reads: Mitigating the impact of DRAM writes on DRAM reads. |
39 | AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. |
37 | System-level implications of disaggregated memory. |
28 | MORSE: Multi-objective reconfigurable self-optimizing memory scheduler. |
26 | JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers. |
26 | Statistical performance comparisons of computers. |
24 | ?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory. |
22 | Power balanced pipelines. |
22 | Network congestion avoidance through Speculative Reservation. |
21 | Decoupled dynamic cache segmentation. |
20 | MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets. |
18 | WEST: Cloning data cache behavior using Stochastic Traces. |
18 | Supporting efficient collective communication in NoCs. |
18 | Pacman: Tolerating asymmetric data races with unintrusive hardware. |
16 | Network within a network approach to create a scalable high-radix router microarchitecture. |
15 | Parabix: Boosting the efficiency of text processing on commodity processors. |
15 | Cache restoration for highly partitioned virtualized systems. |
13 | Accelerating business analytics applications. |
12 | Adaptive Set-Granular Cooperative Caching. |
10 | BulkSMT: Designing SMT processors for atomic-block execution. |
7 | Architectural support for synchronization-free deterministic parallel programming. |
6 | Flexible register management using reference counting. |
4 | BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks. |
4 | Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor. |
2011¶
Cited by | Paper title |
---|---|
262 | Relaxing non-volatility for fast and energy-efficient STT-RAM caches. |
229 | A quantitative performance analysis model for GPU architectures. |
177 | Dynamically Specialized Datapaths for energy efficient computing. |
166 | Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. |
154 | Thread block compaction for efficient SIMT control flow. |
140 | I-CASH: Intelligently Coupled Array of SSD and HDD. |
136 | FREE-p: Protecting non-volatile memory against both hard and soft errors. |
131 | CHIPPER: A low-complexity bufferless deflection router. |
110 | Cuckoo directory: A scalable directory for many-core systems. |
110 | Beyond block I/O: Rethinking traditional storage primitives. |
76 | SolarCore: Solar energy driven multi-core architecture power management. |
75 | HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. |
73 | Calvin: Deterministic or not? Free will to choose. |
72 | Shared last-level TLBs for chip multiprocessors. |
68 | CloudCache: Expanding and shrinking private caches. |
62 | Addressing system-level trimming issues in on-chip nanophotonic networks. |
62 | A case for guarded power gating for multi-core processors. |
61 | Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system. |
60 | Practical and secure PCM systems by online detection of malicious write streams. |
56 | Programming the cloud. |
54 | Archipelago: A polymorphic cache design for enabling robust near-threshold operation. |
48 | NUcache: An efficient multicore cache organization based on Next-Use distance. |
45 | A new server I/O architecture for high speed networks. |
44 | Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols. |
44 | Fast thread migration via cache working set prediction. |
41 | MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy. |
39 | Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. |
37 | Efficient complex operators for irregular codes. |
34 | Abstraction and microarchitecture scaling in early-stage power modeling. |
32 | Bloom Filter Guided Transaction Scheduling. |
30 | Hardware/software techniques for DRAM thermal management. |
30 | Achieving uniform performance and maximizing throughput in the presence of heterogeneity. |
30 | Efficient data streaming with on-chip accelerators: Opportunities and challenges. |
27 | ACCESS: Smart scheduling for asymmetric cache CMPs. |
27 | Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors. |
26 | Power shifting in Thrifty Interconnection Network. |
25 | Data-triggered threads: Eliminating redundant computation. |
22 | Storage free confidence estimation for the TAGE branch predictor. |
20 | Offline symbolic analysis to infer Total Store Order. |
19 | HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. |
18 | Checked Load: Architectural support for JavaScript type-checking on mobile processors. |
9 | Exploiting criticality to reduce bottlenecks in distributed uniprocessors. |
8 | Hardware/software-based diagnosis of load-store queues using expandable activity logs. |
5 | Fg-STP: Fine-Grain Single Thread Partitioning on Multicores. |
5 | Architectural framework for supporting operating system survivability. |
3 | MOPED: Orchestrating interprocess message data on CMPs. |
3 | Safe and efficient supervised memory systems. |
0 | How’s the parallel computing revolution going? |
2010¶
Cited by | Paper title |
---|---|
386 | Graphite: A distributed parallel simulator for multicores. |
318 | ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. |
229 | Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing. |
213 | An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. |
191 | High performance network virtualization with SR-IOV. |
143 | FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar. |
131 | Operating system support for overlapping-ISA heterogeneous multi-core architectures. |
120 | Application performance modeling in a virtualized environment. |
115 | Scalable architectural support for trusted software. |
112 | A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement. |
103 | Designing a processor from the ground up to allow voltage/reliability tradeoffs. |
91 | Interval simulation: Raising the level of abstraction in architectural simulation. |
90 | CHOP: Adaptive filter-based DRAM caching for CMP server platforms. |
83 | Towards scalable, energy-efficient, bus-based on-chip networks. |
81 | Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance. |
59 | Worth their watts? - an empirical study of datacenter servers. |
42 | Simple virtual channel allocation for high throughput and high frequency on-chip routers. |
39 | ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture. |
32 | A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems. |
30 | UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. |
29 | DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. |
28 | Explaining cache SER anomaly using DUE AVF measurement. |
24 | BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution. |
22 | Value Based BTB Indexing for indirect jump prediction. |
19 | IADVS: On-demand performance for interactive applications. |
17 | LiteTM: Reducing transactional state overhead. |
16 | SIF: Overcoming the limitations of SIMD devices via implicit permutation. |
16 | StimulusCache: Boosting performance of chip multiprocessors with excess cache. |
16 | Delay-Hiding energy management mechanisms for DRAM. |
13 | Exascale computing: The challenges and opportunities in the next decade. |
10 | COMIC++: A software SVM system for heterogeneous multicore accelerator clusters. |
8 | Handling branches in TLS systems with Multi-Path Execution. |
6 | High-Performance low-vcc in-order core. |
5 | HARE: Hardware assisted reverse execution. |
5 | DMA++: on the fly data realignment for on-chip memories. |
4 | LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores. |
1 | Architecting for power management: The IBM POWER7TMapproach. |
0 | Is hardware innovation over? |
0 | Extreme scale computing: Challenges and opportunities. |
2009¶
Cited by | Paper title |
---|---|
348 | A novel architecture of the 3D stacked MRAM L2 cache for CMPs. |
183 | Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. |
170 | Express Cube Topologies for on-Chip Interconnects. |
123 | Adaptive Spill-Receive for robust high-performance caching in CMPs. |
119 | Variation-aware dynamic voltage/frequency scaling. |
106 | Elastic-buffer flow control for on-chip networks. |
95 | Eliminating microarchitectural dependency from Architectural Vulnerability. |
95 | Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. |
94 | Prediction router: Yet another low latency on-chip router architecture. |
89 | PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. |
89 | Accurate microarchitecture-level fault modeling for studying hardware faults. |
88 | Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. |
86 | A low-radix and low-diameter 3D interconnection network design. |
86 | CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. |
84 | Blueshift: Designing processors for timing speculation from the ground up. |
76 | A first-order fine-grained multithreaded throughput model. |
73 | Bridging the computation gap between programmable processors and hardwired accelerators. |
66 | Voltage emergency prediction: Using signatures to reduce operating margins. |
65 | Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. |
65 | In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. |
63 | Hardware-software integrated approaches to defend against software cache-based side channel attacks. |
56 | MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. |
56 | Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics. |
52 | iCFP: Tolerating all-level cache misses in in-order processors. |
49 | Design and implementation of software-managed caches for multicores with local memory. |
47 | Dacota: Post-silicon validation of the memory subsystem in multi-core designs. |
36 | An intelligent IT infrastructure for the future. |
34 | Reconciling specialization and flexibility through compound circuits. |
34 | Fast complete memory consistency verification. |
33 | Characterization of Direct Cache Access on multi-core systems and 10GbE. |
28 | Practical off-chip meta-data for temporal memory streaming. |
23 | Architectural Contesting. |
19 | Feedback mechanisms for improving probabilistic memory prefetching. |
19 | Soft error vulnerability aware process variation mitigation. |
16 | Criticality-based optimizations for efficient load processing. |
3 | Lightweight predication support for out of order processors. |
2 | Opportunities beyond single-core microprocessors. |
0 | Industrial perspectives panel. |
0 | Multi-core demands multi-interfaces. |
2008¶
Cited by | Paper title |
---|---|
1235 | Amdahl’s Law in the multicore era. |
617 | System level analysis of fast, per-core DVFS using on-chip switching regulators. |
336 | Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. |
315 | Regional congestion awareness for load balance in networks-on-chip. |
231 | CMP network-on-chip overlaid with multi-band RF-interconnect. |
211 | Cluster-level feedback power control for performance optimization. |
138 | FlexiTaint: A programmable accelerator for dynamic taint propagation. |
132 | A comprehensive approach to DRAM power management. |
107 | C-Oracle: Predictive thermal management for data centers. |
103 | Uncovering hidden loop level parallelism in sequential applications. |
84 | Performance and power optimization through data compression in Network-on-Chip architectures. |
74 | DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors. |
65 | An OS-based alternative to full hardware coherence on tiled CMPs. |
64 | Automated microprocessor stressmark generation. |
49 | Design and implementation of the blue gene/P snoop filter. |
47 | Thread-safe dynamic binary translation using transactional memory. |
42 | EXCES: External caching in energy saving storage systems. |
38 | Runtime validation of memory ordering using constraint graph checking. |
34 | Supporting highly-decoupled thread-level redundancy for parallel programs. |
31 | Fundamental performance constraints in horizontal fusion of in-order cores. |
27 | Single-level integrity and confidentiality protection for distributed shared memory multiprocessors. |
26 | High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation. |
26 | Power-Efficient DRAM Speculation. |
24 | Address-branch correlation: A novel locality for long-latency hard-to-predict branches. |
23 | Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation. |
21 | Runahead Threads to improve SMT performance. |
20 | Roughness of microarchitectural design topologies and its implications for optimization. |
18 | Prediction of CPU idle-busy activity pattern. |
14 | PEEP: Exploiting predictability of memory dependences in SMT processors. |
12 | PaCo: Probability-based path confidence prediction. |
10 | Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again. |
10 | Speculative instruction validation for performance-reliability trade-off. |
8 | Performance-aware speculation control using wrong path usefulness prediction. |
5 | Fabric convergence implications on systems architecture. |
4 | Branch-mispredict level parallelism (BLP) for control independence. |
0 | Intel’s Tera-scale Computing Project: The first five years, the next five years. |
0 | Compilers and parallel computing systems. |
2007¶
Cited by | Paper title |
---|---|
1022 | Evaluating MapReduce for Multi-core and Multiprocessor Systems. |
373 | LogTM-SE: Decoupling Hardware Transactional Memory from Caches. |
243 | Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. |
190 | Concurrent Direct Network Access for Virtual Machine Monitors. |
159 | Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors. |
147 | A Scalable, Non-blocking Approach to Transactional Memory. |
144 | Application-Level Correctness and its Impact on Fault Tolerance. |
141 | An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. |
137 | HARD: Hardware-Assisted Lockset-based Race Detection. |
125 | A Burst Scheduling Access Reordering Mechanism. |
113 | Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications. |
105 | Perturbation-based Fault Screening. |
102 | MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging. |
97 | Illustrative Design Space Studies with Microarchitectural Regression Models. |
79 | Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling. |
65 | A Memory-Level Parallelism Aware Fetch Policy for SMT Processors. |
64 | Interactions Between Compression and Prefetching in Chip Multiprocessors. |
64 | Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines. |
63 | An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. |
57 | Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat. |
41 | Colorama: Architectural Support for Data-Centric Synchronization. |
40 | Accelerating and Adapting Precomputation Threads for Effcient Prefetching. |
39 | Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping. |
38 | A Domain-Specific On-Chip Network Design for Large Scale Cache Systems. |
36 | Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures. |
35 | A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures. |
25 | Optical Interconnect Opportunities for Future Server Memory Systems. |
24 | Exploiting Postdominance for Speculative Parallelization. |
21 | Improving Branch Prediction and Predicated Execution in Out-of-Order Processors. |
14 | Implications of Device Timing Variability on Full Chip Timing. |
4 | Interconnect-Centric Computing. |
3 | Petascale Computing Research Challenges - A Manycore Perspective. |
2006¶
Cited by | Paper title |
---|---|
770 | LogTM: log-based transactional memory. |
266 | Dynamic power-performance adaptation of parallel computation on chip multiprocessors. |
176 | BulletProof: a defect-tolerant CMP switch architecture. |
169 | CMP design space exploration subject to physical constraints. |
164 | Construction and use of linear regression models for processor performance analysis. |
143 | Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads. |
127 | Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM. |
125 | The common case transactional behavior of multithreaded programs. |
112 | Phase characterization for power: evaluating control-flow-based and event-counter-based techniques. |
91 | CORD: cost-effective (and nearly overhead-free) order-recording and data race detection. |
86 | DMA-aware memory energy management. |
86 | Exploiting parallelism and structure to accelerate the simulation of chip multi-processors. |
86 | ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers. |
77 | Understanding the performance-temperature interactions in disk I/O of server workloads. |
77 | High performance file I/O for the Blue Gene/L supercomputer. |
49 | InfoShield: a security architecture for protecting information usage in memory. |
45 | An approach for implementing efficient superscalar CISC processors. |
33 | A decoupled KILO-instruction processor. |
31 | Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. |
28 | Increasing the cache efficiency by eliminating noise. |
27 | Efficient instruction schedulers for SMT processors. |
23 | Completely verifying memory consistency of test program executions. |
21 | Software-hardware cooperative memory disambiguation. |
20 | Store vectors for scalable memory dependence prediction and scheduling. |
17 | Probabilistic counter updates for predictor hysteresis and stratification. |
12 | Chip-multiprocessing and beyond. |
5 | Speculative synchronization and thread management for fine granularity threads. |
2 | Industrial Perspectives: Platform Design Challenges with Many cores. |
0 | Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth. |
0 | Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps. |
0 | New architectures for a new biology. |
2005¶
Cited by | Paper title |
---|---|
616 | Unbounded Transactional Memory. |
589 | Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. |
499 | Power Efficient Processor Architecture and The Cell Processor. |
395 | The Soft Error Problem: An Architectural Perspective. |
198 | Chip Multithreading: Opportunities and Challenges. |
183 | Performance, Energy, and Thermal Considerations for SMT and CMP Architectures. |
177 | SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs. |
136 | A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks. |
117 | Transition Phase Classification and Prediction. |
111 | Characterizing and Comparing Prevailing Simulation Techniques. |
104 | Improving Multiple-CMP Systems Using Token Coherence. |
91 | A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. |
90 | Checkpointed Early Load Retirement. |
84 | A Unified Compressed Memory Hierarchy. |
81 | Trends in High-Performance Processors. |
78 | Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors. |
75 | Distributing the Frontend for Temperature Reduction. |
67 | SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors. |
64 | On the Limits of Leakage Power Reduction in Caches. |
61 | Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors. |
61 | A Small, Fast and Low-Power Register File by Bit-Partitioning. |
59 | Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. |
54 | Enterprise IT Trends and Implications for Architecture Research. |
53 | Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. |
53 | An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. |
47 | Scatter-Add in Data Parallel Architectures. |
43 | Heat Stroke: Power-Density-Based Denial of Service in SMT. |
34 | Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems. |
28 | Multithreaded Value Prediction. |
27 | Software Directed Issue Queue Power Reduction. |
26 | Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE. |
26 | Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses. |
11 | Tapping ZettaRAMTMfor Low-Power Memory Systems. |
11 | Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions. |
0 | The Future of Computer Architecture Research: An Industrial Perspective. |