ISCA¶
All¶
Cited by | Paper title | Year |
---|---|---|
1553 | Power provisioning for a warehouse-sized computer. | 2007 |
1203 | Dark silicon and the end of multicore scaling. | 2011 |
937 | Scalable high performance main memory system using phase-change memory technology. | 2009 |
875 | Architecting phase change memory as a scalable dram alternative. | 2009 |
756 | Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. | 2010 |
644 | A durable and energy efficient main memory using phase change memory technology. | 2009 |
625 | Corona: System Implications of Emerging Nanophotonic Technology. | 2008 |
610 | Continuous Optimization. | 2005 |
588 | 3D-Stacked Memory Architectures for Multi-core Processors. | 2008 |
547 | Adaptive insertion policies for high performance caching. | 2007 |
539 | Techniques for Multicore Thermal Management: Classification and New Exploration. | 2006 |
532 | An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. | 2009 |
511 | Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. | 2005 |
497 | Virtualizing Transactional Memory. | 2005 |
477 | Cooperative Caching for Chip Multiprocessors. | 2006 |
451 | Anton, a special-purpose machine for molecular dynamics simulation. | 2007 |
448 | Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. | 2008 |
427 | Design and Management of 3D Chip Multiprocessors Using Network-in-Memory. | 2006 |
413 | High performance cache replacement using re-reference interval prediction (RRIP). | 2010 |
411 | Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. | 2005 |
389 | An integrated GPU power and performance model. | 2010 |
380 | Reactive NUCA: near-optimal block placement and replication in distributed caches. | 2009 |
376 | Ensemble-level Power Management for Dense Blade Servers. | 2006 |
368 | Express virtual channels: towards the ideal interconnection fabric. | 2007 |
367 | Technology-Driven, Highly-Scalable Dragonfly Topology. | 2008 |
365 | An effective hybrid transactional memory system with strong isolation guarantees. | 2007 |
363 | Flattened butterfly: a cost-efficient topology for high-radix networks. | 2007 |
361 | Energy proportional datacenter networks. | 2010 |
353 | A reconfigurable fabric for accelerating large-scale datacenter services. | 2014 |
341 | BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. | 2005 |
341 | Optimizing Replication, Communication, and Capacity Allocation in CMPs. | 2005 |
337 | Firefly: illuminating future network-on-chip with nanophotonics. | 2009 |
334 | A High Throughput String Matching Architecture for Intrusion Detection and Prevention. | 2005 |
327 | Bulk Disambiguation of Speculative Threads in Multiprocessors. | 2006 |
318 | Understanding sources of inefficiency in general-purpose chips. | 2010 |
316 | A case for bufferless routing in on-chip networks. | 2009 |
306 | Hybrid cache architecture with disparate memory technologies. | 2009 |
305 | The Impact of Performance Asymmetry in Emerging Multicore Architectures. | 2005 |
295 | Power management of online data-intensive services. | 2011 |
294 | Core fusion: accommodating software diversity in chip multiprocessors. | 2007 |
292 | Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors. | 2008 |
290 | Improving NAND Flash Based Disk Caches. | 2008 |
289 | A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. | 2006 |
288 | Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. | 2008 |
285 | Raksha: a flexible information flow architecture for software security. | 2007 |
272 | GPUWattch: enabling energy optimizations in GPGPUs. | 2013 |
267 | A Case for MLP-Aware Cache Replacement. | 2006 |
267 | PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. | 2009 |
265 | A novel dimensionally-decomposed router for on-chip communication in 3D architectures. | 2007 |
263 | Mitigating Amdahl’s Law through EPI Throttling. | 2005 |
263 | New cache designs for thwarting software cache-based side channel attacks. | 2007 |
258 | Performance pathologies in hardware transactional memory. | 2007 |
254 | NoHype: virtualized cloud infrastructure without the virtualization. | 2010 |
253 | SODA: A Low-power Architecture For Software Radio. | 2006 |
238 | Hardware support for WCET analysis of hard real-time multicore systems. | 2009 |
236 | Scaling the bandwidth wall: challenges in and avenues for CMP scaling. | 2009 |
236 | RAIDR: Retention-aware intelligent DRAM refresh. | 2012 |
232 | Microarchitecture of a High-Radix Router. | 2005 |
232 | Use ECP, not ECC, for hard failures in resistive memories. | 2010 |
230 | Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions. | 2005 |
229 | A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks. | 2006 |
227 | Exploiting Structural Duplication for Lifetime Reliability Enhancement. | 2005 |
225 | Carbon: architectural support for fine-grained parallelism on chip multiprocessors. | 2007 |
225 | Trading off Cache Capacity for Reliability to Enable Low Voltage Operation. | 2008 |
225 | Thread motion: fine-grained power management for multi-core systems. | 2009 |
223 | BulkSC: bulk enforcement of sequential consistency. | 2007 |
222 | Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support. | 2008 |
221 | MIRA: A Multi-layered On-Chip Interconnect Router Architecture. | 2008 |
220 | The V-Way Cache: Demand Based Associativity via Global Replacement. | 2005 |
217 | Architectural Semantics for Practical Transactional Memory. | 2006 |
217 | DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently. | 2008 |
217 | A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. | 2008 |
216 | Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. | 2008 |
215 | Computing Architectural Vulnerability Factors for Address-Based Structures. | 2005 |
211 | Architecture for Protecting Critical Secrets in Microprocessors. | 2005 |
211 | Rethinking DRAM design and organization for energy-constrained multi-cores. | 2010 |
208 | The BlackWidow High-Radix Clos Network. | 2006 |
207 | Virtual hierarchies to support server consolidation. | 2007 |
205 | Scheduling heterogeneous multi-cores through performance impact estimation (PIE). | 2012 |
195 | Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks. | 2005 |
195 | An Ultra Low Power System Architecture for Sensor Network Applications. | 2005 |
195 | Relax: an architectural framework for software recovery of hardware faults. | 2010 |
192 | Configurable isolation: building high availability systems with commodity multi-core processors. | 2007 |
192 | Virtual private caches. | 2007 |
191 | RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. | 2005 |
190 | The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization. | 2009 |
189 | Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. | 2010 |
186 | Dynamic warp subdivision for integrated branch and memory divergence tolerance. | 2010 |
179 | Rerun: Exploiting Episodes for Lightweight Memory Race Recording. | 2008 |
178 | An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors. | 2005 |
175 | Energy-efficient mechanisms for managing thread context in throughput processors. | 2011 |
174 | Temperature-constrained power control for chip multiprocessors with online model estimation. | 2009 |
173 | Reducing cache power with low-cost, multi-bit error-correcting codes. | 2010 |
172 | Web search using mobile cores: quantifying and mitigating the price of efficiency. | 2010 |
172 | The impact of memory subsystem resource sharing on datacenter applications. | 2011 |
170 | Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. | 2009 |
169 | Phastlane: a rapid transit optical routing network. | 2009 |
165 | Spatial Memory Streaming. | 2006 |
162 | Making the fast case common and the uncommon case simple in unbounded transactional memory. | 2007 |
160 | Flexible Decoupled Transactional Memory Support. | 2008 |
160 | Benefits and limitations of tapping into stored energy for datacenters. | 2011 |
159 | Rigel: an architecture and scalable programming interface for a 1000-core accelerator. | 2009 |
157 | Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. | 2007 |
157 | Vantage: scalable and efficient fine-grain cache partitioning. | 2011 |
154 | Disaggregated memory for expansion and sharing in blade servers. | 2009 |
153 | ReCycle: : pipeline adaptation to tolerate process variation. | 2007 |
150 | Design and Evaluation of Hybrid Fault-Detection Systems. | 2005 |
150 | Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. | 2008 |
148 | Opportunistic Transient-Fault Detection. | 2005 |
148 | Direct Cache Access for High Bandwidth Network I/O. | 2005 |
145 | Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. | 2012 |
144 | Towards energy-proportional datacenter memory with mobile DRAM. | 2012 |
143 | Morphable memory system: a robust architecture for exploiting multi-level phase change memories. | 2010 |
143 | ZSim: fast and accurate microarchitectural simulation of thousand-core systems. | 2013 |
141 | Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing. | 2010 |
140 | A case for exploiting subarray-level parallelism (SALP) in DRAM. | 2012 |
139 | Improving Cost, Performance, and Security of Memory Encryption and Authentication. | 2006 |
139 | Aérgia: exploiting packet latency slack in on-chip networks. | 2010 |
138 | An integrated hardware-software approach to flexible transactional memory. | 2007 |
137 | Limiting the power consumption of main memory. | 2007 |
137 | Achieving predictable performance through better memory controller placement in many-core CMPs. | 2009 |
137 | A case for an interleaving constrained shared-memory multi-processor. | 2009 |
136 | Scale-out processors. | 2012 |
134 | TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. | 2008 |
134 | Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. | 2011 |
132 | Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. | 2011 |
131 | Interconnect-Aware Coherence Protocols for Chip Multiprocessors. | 2006 |
131 | Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. | 2014 |
130 | Flexible Hardware Acceleration for Instruction-Grain Program Monitoring. | 2008 |
127 | Comparing memory systems for chip multiprocessors. | 2007 |
125 | A Robust Main-Memory Compression Scheme. | 2005 |
125 | Dynamic prediction of architectural vulnerability from microarchitectural state. | 2007 |
124 | Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. | 2005 |
124 | Architectural core salvaging in a multi-core processor for hard-error tolerance. | 2009 |
124 | A dynamically configurable coprocessor for convolutional neural networks. | 2010 |
124 | Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. | 2010 |
123 | Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers. | 2013 |
122 | Temporal Streaming of Shared Memory. | 2005 |
121 | Thin servers with smart pipes: designing SoC accelerators for memcached. | 2013 |
120 | Learning-Based SMT Processor Resource Distribution via Hill-Climbing. | 2006 |
120 | TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time. | 2006 |
120 | FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. | 2011 |
119 | An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms. | 2013 |
118 | Analysis of the O-GEometric History Length Branch Predictor. | 2005 |
118 | Atom-Aid: Detecting and Surviving Atomicity Violations. | 2008 |
118 | Managing distributed UPS energy for effective power capping in data centers. | 2012 |
114 | DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip. | 2011 |
113 | Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory. | 2008 |
113 | VEAL: Virtualized Execution Accelerator for Loops. | 2008 |
111 | Energy-efficient cache design using variable-strength error-correcting codes. | 2011 |
110 | Translation caching: skip, don’t walk (the page table). | 2010 |
109 | Energy Optimization of Subthreshold-Voltage Sensor Network Processors. | 2005 |
109 | Interconnect design considerations for large NUCA caches. | 2007 |
107 | Efficient virtual memory for big memory servers. | 2013 |
106 | Memory mapped ECC: low-cost error protection for last level caches. | 2009 |
106 | PreSET: Improving performance of phase change memories by exploiting asymmetry in write times. | 2012 |
105 | Convolution engine: balancing efficiency&flexibility in specialized computing. | 2013 |
103 | Examining ACE analysis reliability estimates using fault-injection. | 2007 |
101 | Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors. | 2005 |
101 | Towards energy proportionality for large-scale latency-critical workloads. | 2014 |
100 | SigRace: signature-based data race detection. | 2009 |
100 | Re-architecting DRAM memory systems with monolithically integrated silicon photonics. | 2010 |
99 | High Efficiency Counter Mode Security Architecture via Prediction and Precomputation. | 2005 |
99 | Mechanisms for store-wait-free multiprocessors. | 2007 |
99 | The impact of management operations on the virtualized datacenter. | 2010 |
99 | SieveStore: a highly-selective, ensemble-level disk cache for cost-performance. | 2010 |
98 | A Tree Based Router Search Engine Architecture with Single Port Memories. | 2005 |
98 | Scalable power control for many-core architectures running multi-threaded applications. | 2011 |
98 | Prefetch-aware shared resource management for multi-core systems. | 2011 |
98 | A scalable processing-in-memory accelerator for parallel graph processing. | 2015 |
97 | Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. | 2011 |
96 | MetaTM//TxLinux: transactional memory for an operating system. | 2007 |
96 | Spatio-temporal memory streaming. | 2009 |
96 | Orchestrated scheduling and prefetching for GPGPUs. | 2013 |
95 | Piecewise Linear Branch Prediction. | 2005 |
95 | Robust architectural support for transactional memory in the power architecture. | 2013 |
95 | General-purpose code acceleration with limited-precision analog computation. | 2014 |
94 | AnySP: anytime anywhere anyway signal processing. | 2009 |
94 | Modeling critical sections in Amdahl’s law and its implications for multicore design. | 2010 |
94 | Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. | 2013 |
91 | A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime. | 2008 |
90 | Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. | 2007 |
90 | InvisiFence: performance-transparent memory ordering in conventional multiprocessors. | 2009 |
89 | Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. | 2010 |
88 | Rotary router: an efficient architecture for CMP interconnection networks. | 2007 |
88 | The virtual write queue: coordinating DRAM and last-level cache policies. | 2010 |
88 | A case for heterogeneous on-chip interconnects for CMPs. | 2011 |
88 | Memory persistency. | 2014 |
86 | Silicon-photonic network architectures for scalable, power-efficient multi-chip systems. | 2010 |
84 | Evolution of thread-level parallelism in desktop applications. | 2010 |
84 | Catnap: energy proportional multiple network-on-chip. | 2013 |
83 | On the feasibility of online malware detection with performance counters. | 2013 |
82 | Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management. | 2005 |
82 | Hardware atomicity for reliable software speculation. | 2007 |
82 | A defect-tolerant accelerator for emerging high-performance applications. | 2012 |
80 | Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. | 2005 |
80 | iSwitch: Coordinating and optimizing renewable energy powered server clusters. | 2012 |
80 | EIE: Efficient Inference Engine on Compressed Deep Neural Network. | 2016 |
79 | An abacus turn model for time/space-efficient reconfigurable routing. | 2011 |
78 | ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency. | 2008 |
77 | An intra-chip free-space optical interconnect. | 2010 |
76 | Rescue: A Microarchitecture for Testability and Defect Tolerance. | 2005 |
76 | Power model validation through thermal measurements. | 2007 |
76 | ShiDianNao: shifting vision processing closer to the sensor. | 2015 |
75 | Online Estimation of Architectural Vulnerability Factor for Soft Errors. | 2008 |
75 | Can traditional programming bridge the Ninja performance gap for parallel computing applications? | 2012 |
74 | Application-aware deadlock-free oblivious routing. | 2009 |
74 | Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs. | 2011 |
74 | Bypass and insertion algorithms for exclusive last-level caches. | 2011 |
74 | Simultaneous branch and warp interweaving for sustained GPU performance. | 2012 |
74 | Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. | 2014 |
73 | Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. | 2010 |
73 | TimeWarp: Rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks. | 2012 |
73 | The Yin and Yang of power and performance for asymmetric hardware and managed software. | 2012 |
72 | A 64-bit stream processor architecture for scientific applications. | 2007 |
71 | iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures. | 2008 |
70 | Scalable Load and Store Processing in Latency Tolerant Processors. | 2005 |
70 | Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture. | 2006 |
70 | Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices. | 2009 |
69 | Techniques for Efficient Processing in Runahead Execution Engines. | 2005 |
69 | Automated design of application specific superscalar processors: an analytical approach. | 2007 |
69 | Polymorphic On-Chip Networks. | 2008 |
69 | Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security. | 2011 |
68 | Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs. | 2006 |
68 | Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. | 2010 |
66 | Indirect adaptive routing on large scale interconnection networks. | 2009 |
66 | Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput. | 2011 |
66 | The role of optics in future high radix switch design. | 2011 |
66 | Navigating big data with high-throughput, energy-efficient data partitioning. | 2013 |
65 | A case for FAME: FPGA architecture model execution. | 2010 |
65 | Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems. | 2011 |
65 | End-to-end sequential consistency. | 2012 |
64 | An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems. | 2005 |
64 | Mechanisms for bounding vulnerabilities of processor structures. | 2007 |
64 | A case for random shortcut topologies for HPC interconnects. | 2012 |
64 | “Whare-map: heterogeneity in “”homogeneous”” warehouse-scale computers. “ | 2013 |
64 | Design space exploration and optimization of path oblivious RAM in secure processors. | 2013 |
62 | Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches. | 2006 |
62 | Internet-scale service infrastructure efficiency. | 2009 |
61 | LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. | 2012 |
60 | An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors. | 2006 |
60 | Simultaneous speculative threading: a novel pipeline architecture implemented in sun’s rock processor. | 2009 |
60 | ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. | 2010 |
60 | SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. | 2011 |
60 | ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates. | 2013 |
60 | A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. | 2013 |
59 | Deconstructing Commodity Storage Clusters. | 2005 |
59 | Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. | 2006 |
59 | Rapid identification of architectural bottlenecks via precise event counting. | 2011 |
59 | Probabilistic Shared Cache Management (PriSM). | 2012 |
58 | Branch regulation: Low-overhead protection from code reuse attacks. | 2012 |
58 | Triggered instructions: a control paradigm for spatially-programmed architectures. | 2013 |
58 | The CHERI capability model: Revisiting RISC in an age of risk. | 2014 |
57 | Thermal modeling and management of DRAM memory systems. | 2007 |
57 | Heracles: improving resource efficiency at scale. | 2015 |
57 | PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. | 2015 |
56 | Side-channel vulnerability factor: A metric for measuring information leakage. | 2012 |
55 | Cohesion: a hybrid memory model for accelerators. | 2010 |
54 | Memory Model = Instruction Reordering + Store Atomicity. | 2006 |
54 | LINQits: big data on little clients. | 2013 |
53 | An Evaluation Framework and Instruction Set Architecture for Ion-Trap Based Quantum Micro-Architectures. | 2005 |
53 | Profiling a warehouse-scale computer. | 2015 |
52 | WiDGET: Wisconsin decoupled grid execution tiles. | 2010 |
52 | SpecTLB: a mechanism for speculative address translation. | 2011 |
52 | Reducing memory reference energy with opportunistic virtual caching. | 2012 |
52 | SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip. | 2013 |
51 | Quantum Memory Hierarchies: Efficient Designs to Match Available Parallelism in Quantum Computing. | 2006 |
51 | Sampling + DMR: practical and low-overhead permanent fault detection. | 2011 |
51 | Tri-level-cell phase change memory: toward an efficient and reliable memory system. | 2013 |
50 | Reducing memory access latency with asymmetric DRAM bank organizations. | 2013 |
50 | Enabling preemptive multiprogramming on GPUs. | 2014 |
49 | CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. | 2012 |
49 | Utility-based acceleration of multithreaded applications on asymmetric CMPs. | 2013 |
48 | Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors. | 2007 |
48 | Continuous real-world inputs can open up alternative accelerator designs. | 2013 |
47 | ParallAX: an architecture for real-time physics. | 2007 |
47 | i-NVMM: a secure non-volatile main memory system with incremental encryption. | 2011 |
47 | The dynamic granularity memory system. | 2012 |
46 | RENO - A Rename-Based Instruction Optimizer. | 2005 |
46 | Stream chaining: exploiting multiple levels of correlation in data prefetching. | 2009 |
46 | Physically Addressed Queueing (PAQ): Improving parallelism in Solid State Disks. | 2012 |
45 | Late-binding: enabling unordered load-store queues. | 2007 |
45 | Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction. | 2008 |
45 | iGPU: Exception support and speculative execution on GPUs. | 2012 |
44 | Store Buffer Design in First-Level Multibanked Data Caches. | 2005 |
44 | Multiple Instruction Stream Processor. | 2006 |
44 | Multi-execution: multicore caching for data-similar executions. | 2009 |
43 | VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. | 2007 |
42 | Improving Program Efficiency by Packing Instructions into Registers. | 2005 |
42 | Area-Performance Trade-offs in Tiled Dataflow Architectures. | 2006 |
42 | Using hardware vulnerability factors to enhance AVF analysis. | 2010 |
42 | RADISH: Always-on sound and complete race detection in software and hardware. | 2012 |
42 | Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems. | 2013 |
41 | Performance and power of cache-based reconfigurable computing. | 2009 |
41 | A new perspective for efficient virtual-cache coherence. | 2013 |
41 | Flicker: a dynamically adaptive architecture for power limited multicore systems. | 2013 |
41 | An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. | 2013 |
40 | Achieving Out-of-Order Performance with Almost In-Order Complexity. | 2008 |
40 | A fault tolerant, area efficient architecture for Shor’s factoring algorithm. | 2009 |
40 | Dynamic performance tuning for speculative threads. | 2009 |
40 | BOOM: Enabling mobile memory based low-power server DIMMs. | 2012 |
39 | Reducing Startup Time in Co-Designed Virtual Machines. | 2006 |
39 | Matrix scheduler reloaded. | 2007 |
39 | From Speculation to Security: Practical and Efficient Information Flow Tracking Using Speculative Hardware. | 2008 |
39 | Forwardflow: a scalable core for power-constrained CMPs. | 2010 |
39 | RETCON: transactional repair without replay. | 2010 |
39 | CPPC: correctable parity protected cache. | 2011 |
39 | AC-DIMM: associative computing with STT-MRAM. | 2013 |
39 | SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. | 2014 |
38 | Atomic Vector Operations on Chip Multiprocessors. | 2008 |
38 | LReplay: a pending period based deterministic replay scheme. | 2010 |
38 | Automatic abstraction and fault tolerance in cortical microachitectures. | 2011 |
38 | Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. | 2013 |
37 | Software-Controlled Priority Characterization of POWER5 Processor. | 2008 |
37 | SC2: A statistical compression cache scheme. | 2014 |
36 | Transparent control independence (TCI). | 2007 |
36 | ECMon: exposing cache events for monitoring. | 2009 |
36 | TLSync: support for multiple fast barriers using on-chip transmission lines. | 2011 |
36 | Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management. | 2011 |
36 | Buffer-on-board memory systems. | 2012 |
36 | WebCore: Architectural support for mobile Web browsing. | 2014 |
36 | SynFull: Synthetic traffic models capturing cache coherent behaviour. | 2014 |
35 | Dynamic Verification of Sequential Consistency. | 2005 |
35 | Running a Quantum Circuit at the Speed of Data. | 2008 |
35 | Watchdog: Hardware for safe and secure manual memory management and full memory safety. | 2012 |
35 | Improving memory scheduling via processor-side load criticality information. | 2013 |
35 | Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor. | 2014 |
34 | Virtualizing performance asymmetric multi-core systems. | 2011 |
33 | Tolerating Dependences Between Large Speculative Threads Via Sub-Threads. | 2006 |
33 | Intra-disk Parallelism: An Idea Whose Time Has Come. | 2008 |
33 | Demand-driven software race detection using hardware performance counters. | 2011 |
33 | Exploring memory consistency for massively-threaded throughput-oriented processors. | 2013 |
33 | STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies. | 2014 |
33 | Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. | 2014 |
32 | Tolerating process variations in nanophotonic on-chip networks. | 2012 |
32 | FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion. | 2012 |
32 | The locality-aware adaptive cache coherence protocol. | 2013 |
32 | Resilient die-stacked DRAM caches. | 2013 |
32 | Data reorganization in memory using 3D-stacked DRAM. | 2015 |
32 | ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. | 2016 |
32 | Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. | 2016 |
31 | Data marshaling for multi-core architectures. | 2010 |
31 | A case for globally shared-medium on-chip interconnect. | 2011 |
31 | Zombie memory: extending memory lifetime by reviving dead blocks. | 2013 |
31 | Rumba: an online quality management system for approximate computing. | 2015 |
31 | PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. | 2016 |
30 | Conditional Memory Ordering. | 2006 |
30 | Interconnection Networks for Scalable Quantum Computers. | 2006 |
30 | Cooperative boosting: needy versus greedy power management. | 2013 |
29 | Aquacore: a programmable architecture for microfluidics. | 2007 |
29 | Boosting single-thread performance in multi-core systems through fine-grain multi-threading. | 2009 |
29 | Timetraveler: exploiting acyclic races for optimizing memory race recording. | 2010 |
29 | Harmony: Collection and analysis of parallel block vectors. | 2012 |
29 | PARDIS: A programmable memory controller for the DDRx interfacing standards. | 2012 |
29 | Virtualizing power distribution in datacenters. | 2013 |
29 | The Dirty-Block Index. | 2014 |
29 | Architecting to achieve a billion requests per second throughput on a single key-value store server platform. | 2015 |
27 | Revisiting hardware-assisted page walks for virtualized systems. | 2012 |
27 | A first-order mechanistic model for architectural vulnerability factor. | 2012 |
27 | A micro-architectural analysis of switched photonic multi-chip interconnects. | 2012 |
27 | Agile, efficient virtualization power management with low-latency server power states. | 2013 |
27 | Redundant memory mappings for fast access to large memories. | 2015 |
27 | DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. | 2015 |
26 | DNA-based molecular architecture with spatially localized components. | 2013 |
26 | Unifying on-chip and inter-node switching within the Anton 2 network. | 2014 |
26 | BlueDBM: an appliance for big data analytics. | 2015 |
25 | Ginger: control independence using tag rewriting. | 2007 |
25 | Flexible reference-counting-based hardware acceleration for garbage collection. | 2009 |
25 | A memory system design framework: creating smart memories. | 2009 |
25 | Rebound: scalable checkpointing for coherent shared memory. | 2011 |
25 | VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors. | 2012 |
25 | Protozoa: adaptive granularity cache coherence. | 2013 |
25 | QuickSAN: a storage area network for fast, distributed, solid state disks. | 2013 |
25 | Architecture implications of pads as a scarce resource. | 2014 |
24 | Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification. | 2006 |
24 | Boosting mobile GPU performance with a decoupled access/execute fragment processor. | 2012 |
24 | Studying multicore processor scaling via reuse distance analysis. | 2013 |
24 | Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. | 2015 |
24 | Warped-compression: enabling power efficient GPUs through register compression. | 2015 |
24 | Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. | 2016 |
23 | Improving writeback efficiency with decoupled last-write prediction. | 2012 |
23 | Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures. | 2012 |
23 | SIMD divergence optimization through intra-warp compaction. | 2013 |
23 | Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. | 2013 |
22 | Bit mapping for balanced PCM cell programming. | 2013 |
22 | Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors. | 2013 |
22 | Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. | 2014 |
22 | A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. | 2015 |
21 | Distributed Arithmetic on a Quantum Multicomputer. | 2006 |
21 | Dynamic MIPS rate stabilization in out-of-order processors. | 2009 |
21 | Moguls: a model to explore the memory hierarchy for bandwidth improvements. | 2011 |
21 | WeeFence: toward making fences free in TSO. | 2013 |
21 | Going vertical in memory management: Handling multiplicity by multi-policy. | 2014 |
21 | SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers. | 2014 |
21 | Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs. | 2015 |
21 | CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. | 2015 |
21 | A fully associative, tagless DRAM cache. | 2015 |
20 | Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors. | 2010 |
20 | Flexible auto-refresh: enabling scalable and energy-efficient DRAM refresh reductions. | 2015 |
20 | BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. | 2015 |
19 | End-to-end register data-flow continuous self-test. | 2009 |
19 | OUTRIDER: efficient memory latency tolerance with decoupled strands. | 2011 |
19 | Inspection resistant memory: Architectural support for security from physical examination. | 2012 |
19 | QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs. | 2013 |
19 | Single-graph multiple flows: Energy efficient design alternative for GPGPUs. | 2014 |
19 | Exploring the potential of heterogeneous von neumann/dataflow execution models. | 2015 |
18 | HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. | 2014 |
18 | Stash: have your scratchpad and cache it too. | 2015 |
18 | Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. | 2016 |
17 | The Future of Virtualization Technology. | 2006 |
17 | Necromancer: enhancing system throughput by animating dead cores. | 2010 |
17 | Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads. | 2014 |
17 | HIOS: A host interface I/O scheduler for Solid State Disks. | 2014 |
17 | Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. | 2016 |
16 | CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM. | 2013 |
16 | Towards sustainable in-situ server systems in the big data era. | 2015 |
16 | HEB: deploying and managing hybrid energy buffers for improving datacenter efficiency and economy. | 2015 |
15 | Counting Dependence Predictors. | 2008 |
15 | Microcoded Architectures for Ion-Tap Quantum Computers. | 2008 |
15 | Sentry: light-weight auxiliary memory access control. | 2010 |
15 | CODOMs: Protecting software with Code-centric memory Domains. | 2014 |
15 | Real-world design and evaluation of compiler-managed GPU redundant multithreading. | 2014 |
15 | EOLE: Paving the way for an effective implementation of value prediction. | 2014 |
15 | Multiple clone row DRAM: a low latency and area optimized DRAM. | 2015 |
14 | Ten ways to waste a parallel computer. | 2009 |
14 | Viper: Virtual pipelines for enhanced reliability. | 2012 |
14 | Enhancing effective throughput for transmission line-based bus. | 2012 |
14 | STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution. | 2013 |
14 | Secure I/O device sharing among virtual machines on multiple hosts. | 2013 |
14 | Page overlays: an enhanced virtual memory framework to enable fine-grained memory management. | 2015 |
14 | Hi-fi playback: tolerating position errors in shift operations of racetrack memory. | 2015 |
13 | Performance and security lessons learned from virtualizing the alpha processor. | 2007 |
13 | The rebirth of neural networks. | 2010 |
13 | CRIB: consolidated rename, issue, and bypass. | 2011 |
13 | ArchRanker: A ranking approach to design space exploration. | 2014 |
13 | Fine-grain task aggregation and coordination on GPUs. | 2014 |
13 | GangES: Gang error simulation for hardware resiliency evaluation. | 2014 |
13 | Manycore network interfaces for in-memory rack-scale computing. | 2015 |
13 | Callback: efficient synchronization without invalidation with a directory just for spin-waiting. | 2015 |
13 | ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. | 2015 |
13 | Accelerating Dependent Cache Misses with an Enhanced Memory Controller. | 2016 |
13 | Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. | 2016 |
12 | Fusion: design tradeoffs in coherent cache hierarchies for accelerators. | 2015 |
12 | SLIP: reducing wire energy in the memory hierarchy. | 2015 |
12 | Cambricon: An Instruction Set Architecture for Neural Networks. | 2016 |
11 | Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits. | 2007 |
11 | End-to-end performance forecasting: finding bottlenecks before they happen. | 2009 |
11 | Microarchitectural mechanisms to exploit value structure in SIMT architectures. | 2013 |
11 | OmniOrder: Directory-based conflict serialization of transactions. | 2014 |
11 | Harmonia: balancing compute and memory power in high-performance GPUs. | 2015 |
11 | RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. | 2016 |
10 | Architectural implications of brick and mortar silicon manufacturing. | 2007 |
10 | Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors. | 2009 |
10 | Improving virtualization in the presence of software managed translation lookaside buffers. | 2013 |
10 | Increasing off-chip bandwidth in multi-core processors with switchable pins. | 2014 |
10 | Race Logic: A hardware acceleration for dynamic programming algorithms. | 2014 |
10 | Flexible software profiling of GPU architectures. | 2015 |
9 | Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines. | 2005 |
9 | A Two-Level Load/Store Queue Based on Execution Locality. | 2008 |
9 | Replay debugging: Leveraging record and replay for program debugging. | 2014 |
9 | Navigating the cache hierarchy with a single lookup. | 2014 |
9 | An examination of the architecture and system-level tradeoffs of employing steep slope devices in 3D CMPs. | 2014 |
9 | Avoiding core’s DUE&SDC via acoustic wave detectors and tailored error containment and recovery. | 2014 |
9 | Thermal time shifting: leveraging phase change materials to reduce cooling costs in warehouse-scale computers. | 2015 |
9 | Probable cause: the deanonymizing effects of approximate DRAM. | 2015 |
9 | COP: to compress and protect main memory. | 2015 |
8 | Setting an error detection infrastructure with low cost acoustic wave detectors. | 2012 |
8 | A low power and reliable charge pump design for Phase Change Memories. | 2014 |
8 | Improving the energy efficiency of Big Cores. | 2014 |
8 | Row-buffer decoupling: A case for low-latency DRAM microarchitecture. | 2014 |
8 | Reducing access latency of MLC PCMs through line striping. | 2014 |
8 | DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules. | 2015 |
8 | PrORAM: dynamic prefetcher for oblivious RAM. | 2015 |
8 | Computer performance microscopy with Shim. | 2015 |
8 | CloudMonatt: an architecture for security health monitoring and attestation of virtual machines in cloud computing. | 2015 |
8 | LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. | 2016 |
7 | Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection. | 2005 |
7 | Moving the needle, computer architecture research in academe and industry. | 2010 |
7 | FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes. | 2011 |
7 | Non-race concurrency bug detection through order-sensitive critical sections. | 2013 |
7 | FASE: finding amplitude-modulated side-channel emanations. | 2015 |
7 | The load slice core microarchitecture. | 2015 |
6 | The End of Scaling? Revolutions in Technology and Microarchitecture as We Pass the 90 Nanometer Node. | 2006 |
6 | Fetch-Criticality Reduction through Control Independence. | 2008 |
6 | Accelerating asynchronous programs through event sneak peek. | 2015 |
6 | Reducing world switches in virtualized environment with flexible cross-world calls. | 2015 |
6 | Semantic locality and context-based prefetching using reinforcement learning. | 2015 |
6 | Efficient execution of memory access phases using dataflow specialization. | 2015 |
6 | Clean: a race detector with cleaner semantics. | 2015 |
6 | A variable warp size architecture. | 2015 |
6 | Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. | 2015 |
6 | Dynamo: Facebook’s Data Center-Wide Power Management System. | 2016 |
6 | Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. | 2016 |
6 | Agile Paging: Exceeding the Best of Nested and Shadow Paging. | 2016 |
5 | Improving the future by examining the past. | 2010 |
5 | Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability. | 2012 |
5 | Configurable fine-grain protection for multicore processor virtualization. | 2012 |
5 | Quantum rotations: a case study in static and dynamic machine-code generation for quantum computers. | 2013 |
5 | Unified address translation for memory-mapped SSDs with FlashMap. | 2015 |
5 | Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. | 2016 |
5 | Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. | 2016 |
5 | Energy Efficient Architecture for Graph Analytics Accelerators. | 2016 |
5 | MITTS: Memory Inter-arrival Time Traffic Shaping. | 2016 |
5 | Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. | 2016 |
5 | Biscuit: A Framework for Near-Data Processing of Big Data Workloads. | 2016 |
5 | CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture. | 2016 |
4 | IVEC: off-chip memory integrity protection for both security and reliability. | 2010 |
4 | MemGuard: A low cost and energy efficient design to support and enhance memory system reliability. | 2014 |
4 | Fractal++: Closing the performance gap between fractal and conventional coherence. | 2014 |
4 | Branch vanguard: decomposing branch functionality into prediction and resolution instructions. | 2015 |
4 | SHRINK: reducing the ISA complexity via instruction recycling. | 2015 |
4 | MiSAR: minimalistic synchronization accelerator with resource overflow management. | 2015 |
4 | Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement. | 2016 |
4 | Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs. | 2016 |
4 | Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures. | 2016 |
4 | ASIC Clouds: Specializing the Datacenter. | 2016 |
4 | Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. | 2016 |
3 | Releasing efficient beta cores to market early. | 2011 |
3 | BlockChop: Dynamic squash elimination for hybrid processor architecture. | 2012 |
3 | Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol. | 2014 |
3 | MBus: an ultra-low power interconnect bus for next generation nanopower systems. | 2015 |
3 | Cost-effective speculative scheduling in high performance processors. | 2015 |
3 | Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. | 2016 |
3 | Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers. | 2016 |
2 | Shared caches in multicores: the good, the bad, and the ugly. | 2010 |
2 | Deconfigurable microprocessor architectures for silicon debug acceleration. | 2013 |
2 | VIP: virtualizing IP chains on handheld platforms. | 2015 |
2 | Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures. | 2016 |
2 | ARM Virtualization: Performance and Architectural Implications. | 2016 |
2 | Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems. | 2016 |
2 | PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures. | 2016 |
2 | Future Vector Microprocessor Extensions for Data Aggregations. | 2016 |
2 | LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches. | 2016 |
2 | Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing. | 2016 |
2 | Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration. | 2016 |
2 | ActivePointers: A Case for Software Address Translation on GPUs. | 2016 |
2 | The Anytime Automaton. | 2016 |
1 | Computer Architecture Research and Future Microprocessors: Where Do We Go from Here? | 2006 |
1 | Efficient digital neurons for large scale cortical architectures. | 2014 |
1 | FaultHound: value-locality-based soft-fault tolerance. | 2015 |
1 | Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units. | 2016 |
1 | Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL. | 2016 |
1 | Decoupling Loads for Nano-Instruction Set Computers. | 2016 |
1 | Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading. | 2016 |
1 | Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity. | 2016 |
1 | Boosting Access Parallelism to PCM-Based Main Memory. | 2016 |
1 | Power Attack Defense: Securing Battery-Backed Data Centers. | 2016 |
1 | Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors. | 2016 |
1 | APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. | 2016 |
1 | Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation. | 2016 |
1 | Asymmetry-Aware Work-Stealing Runtimes. | 2016 |
1 | XED: Exposing On-Die Error Detection Information for Strong Memory Reliability. | 2016 |
1 | All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory. | 2016 |
0 | Message from the General Chair. | 2006 |
0 | Message from the Program Chair. | 2006 |
0 | SIGARCH Guidelines. | 2006 |
0 | LaZy superscalar. | 2015 |
0 | DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric. | 2016 |
0 | Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs. | 2016 |
0 | Production-Run Software Failure Diagnosis via Adaptive Communication Tracking. | 2016 |
0 | RelaxFault Memory Repair. | 2016 |
0 | Evaluation of an Analog Accelerator for Linear Algebra. | 2016 |
0 | Base-Victim Compression: An Opportunistic Cache Compression Architecture. | 2016 |
2016¶
Cited by | Paper title |
---|---|
80 | EIE: Efficient Inference Engine on Compressed Deep Neural Network. |
32 | ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. |
32 | Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. |
31 | PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. |
24 | Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. |
18 | Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. |
17 | Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. |
13 | Accelerating Dependent Cache Misses with an Enhanced Memory Controller. |
13 | Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. |
12 | Cambricon: An Instruction Set Architecture for Neural Networks. |
11 | RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. |
8 | LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. |
6 | Dynamo: Facebook’s Data Center-Wide Power Management System. |
6 | Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. |
6 | Agile Paging: Exceeding the Best of Nested and Shadow Paging. |
5 | Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. |
5 | Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. |
5 | Energy Efficient Architecture for Graph Analytics Accelerators. |
5 | MITTS: Memory Inter-arrival Time Traffic Shaping. |
5 | Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. |
5 | Biscuit: A Framework for Near-Data Processing of Big Data Workloads. |
5 | CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture. |
4 | Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement. |
4 | Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs. |
4 | Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures. |
4 | ASIC Clouds: Specializing the Datacenter. |
4 | Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. |
3 | Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. |
3 | Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers. |
2 | Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures. |
2 | ARM Virtualization: Performance and Architectural Implications. |
2 | Exploiting Dynamic Timing Slack for Energy Efficiency in Ultra-Low-Power Embedded Systems. |
2 | PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures. |
2 | Future Vector Microprocessor Extensions for Data Aggregations. |
2 | LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches. |
2 | Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing. |
2 | Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration. |
2 | ActivePointers: A Case for Software Address Translation on GPUs. |
2 | The Anytime Automaton. |
1 | Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units. |
1 | Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL. |
1 | Decoupling Loads for Nano-Instruction Set Computers. |
1 | Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading. |
1 | Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity. |
1 | Boosting Access Parallelism to PCM-Based Main Memory. |
1 | Power Attack Defense: Securing Battery-Backed Data Centers. |
1 | Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors. |
1 | APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. |
1 | Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation. |
1 | Asymmetry-Aware Work-Stealing Runtimes. |
1 | XED: Exposing On-Die Error Detection Information for Strong Memory Reliability. |
1 | All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory. |
0 | DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric. |
0 | Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs. |
0 | Production-Run Software Failure Diagnosis via Adaptive Communication Tracking. |
0 | RelaxFault Memory Repair. |
0 | Evaluation of an Analog Accelerator for Linear Algebra. |
0 | Base-Victim Compression: An Opportunistic Cache Compression Architecture. |
2015¶
Cited by | Paper title |
---|---|
98 | A scalable processing-in-memory accelerator for parallel graph processing. |
76 | ShiDianNao: shifting vision processing closer to the sensor. |
57 | Heracles: improving resource efficiency at scale. |
57 | PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. |
53 | Profiling a warehouse-scale computer. |
32 | Data reorganization in memory using 3D-stacked DRAM. |
31 | Rumba: an online quality management system for approximate computing. |
29 | Architecting to achieve a billion requests per second throughput on a single key-value store server platform. |
27 | Redundant memory mappings for fast access to large memories. |
27 | DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. |
26 | BlueDBM: an appliance for big data analytics. |
24 | Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. |
24 | Warped-compression: enabling power efficient GPUs through register compression. |
22 | A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. |
21 | Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs. |
21 | CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. |
21 | A fully associative, tagless DRAM cache. |
20 | Flexible auto-refresh: enabling scalable and energy-efficient DRAM refresh reductions. |
20 | BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. |
19 | Exploring the potential of heterogeneous von neumann/dataflow execution models. |
18 | Stash: have your scratchpad and cache it too. |
16 | Towards sustainable in-situ server systems in the big data era. |
16 | HEB: deploying and managing hybrid energy buffers for improving datacenter efficiency and economy. |
15 | Multiple clone row DRAM: a low latency and area optimized DRAM. |
14 | Page overlays: an enhanced virtual memory framework to enable fine-grained memory management. |
14 | Hi-fi playback: tolerating position errors in shift operations of racetrack memory. |
13 | Manycore network interfaces for in-memory rack-scale computing. |
13 | Callback: efficient synchronization without invalidation with a directory just for spin-waiting. |
13 | ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. |
12 | Fusion: design tradeoffs in coherent cache hierarchies for accelerators. |
12 | SLIP: reducing wire energy in the memory hierarchy. |
11 | Harmonia: balancing compute and memory power in high-performance GPUs. |
10 | Flexible software profiling of GPU architectures. |
9 | Thermal time shifting: leveraging phase change materials to reduce cooling costs in warehouse-scale computers. |
9 | Probable cause: the deanonymizing effects of approximate DRAM. |
9 | COP: to compress and protect main memory. |
8 | DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules. |
8 | PrORAM: dynamic prefetcher for oblivious RAM. |
8 | Computer performance microscopy with Shim. |
8 | CloudMonatt: an architecture for security health monitoring and attestation of virtual machines in cloud computing. |
7 | FASE: finding amplitude-modulated side-channel emanations. |
7 | The load slice core microarchitecture. |
6 | Accelerating asynchronous programs through event sneak peek. |
6 | Reducing world switches in virtualized environment with flexible cross-world calls. |
6 | Semantic locality and context-based prefetching using reinforcement learning. |
6 | Efficient execution of memory access phases using dataflow specialization. |
6 | Clean: a race detector with cleaner semantics. |
6 | A variable warp size architecture. |
6 | Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. |
5 | Unified address translation for memory-mapped SSDs with FlashMap. |
4 | Branch vanguard: decomposing branch functionality into prediction and resolution instructions. |
4 | SHRINK: reducing the ISA complexity via instruction recycling. |
4 | MiSAR: minimalistic synchronization accelerator with resource overflow management. |
3 | MBus: an ultra-low power interconnect bus for next generation nanopower systems. |
3 | Cost-effective speculative scheduling in high performance processors. |
2 | VIP: virtualizing IP chains on handheld platforms. |
1 | FaultHound: value-locality-based soft-fault tolerance. |
0 | LaZy superscalar. |
2014¶
Cited by | Paper title |
---|---|
353 | A reconfigurable fabric for accelerating large-scale datacenter services. |
131 | Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. |
101 | Towards energy proportionality for large-scale latency-critical workloads. |
95 | General-purpose code acceleration with limited-precision analog computation. |
88 | Memory persistency. |
74 | Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. |
58 | The CHERI capability model: Revisiting RISC in an age of risk. |
50 | Enabling preemptive multiprogramming on GPUs. |
39 | SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. |
37 | SC2: A statistical compression cache scheme. |
36 | WebCore: Architectural support for mobile Web browsing. |
36 | SynFull: Synthetic traffic models capturing cache coherent behaviour. |
35 | Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor. |
33 | STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies. |
33 | Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. |
29 | The Dirty-Block Index. |
26 | Unifying on-chip and inter-node switching within the Anton 2 network. |
25 | Architecture implications of pads as a scarce resource. |
22 | Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. |
21 | Going vertical in memory management: Handling multiplicity by multi-policy. |
21 | SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers. |
19 | Single-graph multiple flows: Energy efficient design alternative for GPGPUs. |
18 | HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. |
17 | Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads. |
17 | HIOS: A host interface I/O scheduler for Solid State Disks. |
15 | CODOMs: Protecting software with Code-centric memory Domains. |
15 | Real-world design and evaluation of compiler-managed GPU redundant multithreading. |
15 | EOLE: Paving the way for an effective implementation of value prediction. |
13 | ArchRanker: A ranking approach to design space exploration. |
13 | Fine-grain task aggregation and coordination on GPUs. |
13 | GangES: Gang error simulation for hardware resiliency evaluation. |
11 | OmniOrder: Directory-based conflict serialization of transactions. |
10 | Increasing off-chip bandwidth in multi-core processors with switchable pins. |
10 | Race Logic: A hardware acceleration for dynamic programming algorithms. |
9 | Replay debugging: Leveraging record and replay for program debugging. |
9 | Navigating the cache hierarchy with a single lookup. |
9 | An examination of the architecture and system-level tradeoffs of employing steep slope devices in 3D CMPs. |
9 | Avoiding core’s DUE&SDC via acoustic wave detectors and tailored error containment and recovery. |
8 | A low power and reliable charge pump design for Phase Change Memories. |
8 | Improving the energy efficiency of Big Cores. |
8 | Row-buffer decoupling: A case for low-latency DRAM microarchitecture. |
8 | Reducing access latency of MLC PCMs through line striping. |
4 | MemGuard: A low cost and energy efficient design to support and enhance memory system reliability. |
4 | Fractal++: Closing the performance gap between fractal and conventional coherence. |
3 | Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol. |
1 | Efficient digital neurons for large scale cortical architectures. |
2013¶
Cited by | Paper title |
---|---|
272 | GPUWattch: enabling energy optimizations in GPGPUs. |
143 | ZSim: fast and accurate microarchitectural simulation of thousand-core systems. |
123 | Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers. |
121 | Thin servers with smart pipes: designing SoC accelerators for memcached. |
119 | An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms. |
107 | Efficient virtual memory for big memory servers. |
105 | Convolution engine: balancing efficiency&flexibility in specialized computing. |
96 | Orchestrated scheduling and prefetching for GPGPUs. |
95 | Robust architectural support for transactional memory in the power architecture. |
94 | Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. |
84 | Catnap: energy proportional multiple network-on-chip. |
83 | On the feasibility of online malware detection with performance counters. |
66 | Navigating big data with high-throughput, energy-efficient data partitioning. |
64 | “Whare-map: heterogeneity in “”homogeneous”” warehouse-scale computers. “ |
64 | Design space exploration and optimization of path oblivious RAM in secure processors. |
60 | ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates. |
60 | A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. |
58 | Triggered instructions: a control paradigm for spatially-programmed architectures. |
54 | LINQits: big data on little clients. |
52 | SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip. |
51 | Tri-level-cell phase change memory: toward an efficient and reliable memory system. |
50 | Reducing memory access latency with asymmetric DRAM bank organizations. |
49 | Utility-based acceleration of multithreaded applications on asymmetric CMPs. |
48 | Continuous real-world inputs can open up alternative accelerator designs. |
42 | Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems. |
41 | A new perspective for efficient virtual-cache coherence. |
41 | Flicker: a dynamically adaptive architecture for power limited multicore systems. |
41 | An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. |
39 | AC-DIMM: associative computing with STT-MRAM. |
38 | Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. |
35 | Improving memory scheduling via processor-side load criticality information. |
33 | Exploring memory consistency for massively-threaded throughput-oriented processors. |
32 | The locality-aware adaptive cache coherence protocol. |
32 | Resilient die-stacked DRAM caches. |
31 | Zombie memory: extending memory lifetime by reviving dead blocks. |
30 | Cooperative boosting: needy versus greedy power management. |
29 | Virtualizing power distribution in datacenters. |
27 | Agile, efficient virtualization power management with low-latency server power states. |
26 | DNA-based molecular architecture with spatially localized components. |
25 | Protozoa: adaptive granularity cache coherence. |
25 | QuickSAN: a storage area network for fast, distributed, solid state disks. |
24 | Studying multicore processor scaling via reuse distance analysis. |
23 | SIMD divergence optimization through intra-warp compaction. |
23 | Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. |
22 | Bit mapping for balanced PCM cell programming. |
22 | Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors. |
21 | WeeFence: toward making fences free in TSO. |
19 | QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs. |
16 | CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM. |
14 | STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution. |
14 | Secure I/O device sharing among virtual machines on multiple hosts. |
11 | Microarchitectural mechanisms to exploit value structure in SIMT architectures. |
10 | Improving virtualization in the presence of software managed translation lookaside buffers. |
7 | Non-race concurrency bug detection through order-sensitive critical sections. |
5 | Quantum rotations: a case study in static and dynamic machine-code generation for quantum computers. |
2 | Deconfigurable microprocessor architectures for silicon debug acceleration. |
2012¶
Cited by | Paper title |
---|---|
236 | RAIDR: Retention-aware intelligent DRAM refresh. |
205 | Scheduling heterogeneous multi-cores through performance impact estimation (PIE). |
145 | Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. |
144 | Towards energy-proportional datacenter memory with mobile DRAM. |
140 | A case for exploiting subarray-level parallelism (SALP) in DRAM. |
136 | Scale-out processors. |
118 | Managing distributed UPS energy for effective power capping in data centers. |
106 | PreSET: Improving performance of phase change memories by exploiting asymmetry in write times. |
82 | A defect-tolerant accelerator for emerging high-performance applications. |
80 | iSwitch: Coordinating and optimizing renewable energy powered server clusters. |
75 | Can traditional programming bridge the Ninja performance gap for parallel computing applications? |
74 | Simultaneous branch and warp interweaving for sustained GPU performance. |
73 | TimeWarp: Rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks. |
73 | The Yin and Yang of power and performance for asymmetric hardware and managed software. |
65 | End-to-end sequential consistency. |
64 | A case for random shortcut topologies for HPC interconnects. |
61 | LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. |
59 | Probabilistic Shared Cache Management (PriSM). |
58 | Branch regulation: Low-overhead protection from code reuse attacks. |
56 | Side-channel vulnerability factor: A metric for measuring information leakage. |
52 | Reducing memory reference energy with opportunistic virtual caching. |
49 | CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. |
47 | The dynamic granularity memory system. |
46 | Physically Addressed Queueing (PAQ): Improving parallelism in Solid State Disks. |
45 | iGPU: Exception support and speculative execution on GPUs. |
42 | RADISH: Always-on sound and complete race detection in software and hardware. |
40 | BOOM: Enabling mobile memory based low-power server DIMMs. |
36 | Buffer-on-board memory systems. |
35 | Watchdog: Hardware for safe and secure manual memory management and full memory safety. |
32 | Tolerating process variations in nanophotonic on-chip networks. |
32 | FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion. |
29 | Harmony: Collection and analysis of parallel block vectors. |
29 | PARDIS: A programmable memory controller for the DDRx interfacing standards. |
27 | Revisiting hardware-assisted page walks for virtualized systems. |
27 | A first-order mechanistic model for architectural vulnerability factor. |
27 | A micro-architectural analysis of switched photonic multi-chip interconnects. |
25 | VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors. |
24 | Boosting mobile GPU performance with a decoupled access/execute fragment processor. |
23 | Improving writeback efficiency with decoupled last-write prediction. |
23 | Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures. |
19 | Inspection resistant memory: Architectural support for security from physical examination. |
14 | Viper: Virtual pipelines for enhanced reliability. |
14 | Enhancing effective throughput for transmission line-based bus. |
8 | Setting an error detection infrastructure with low cost acoustic wave detectors. |
5 | Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability. |
5 | Configurable fine-grain protection for multicore processor virtualization. |
3 | BlockChop: Dynamic squash elimination for hybrid processor architecture. |
2011¶
Cited by | Paper title |
---|---|
1203 | Dark silicon and the end of multicore scaling. |
295 | Power management of online data-intensive services. |
175 | Energy-efficient mechanisms for managing thread context in throughput processors. |
172 | The impact of memory subsystem resource sharing on datacenter applications. |
160 | Benefits and limitations of tapping into stored energy for datacenters. |
157 | Vantage: scalable and efficient fine-grain cache partitioning. |
134 | Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. |
132 | Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. |
120 | FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. |
114 | DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip. |
111 | Energy-efficient cache design using variable-strength error-correcting codes. |
98 | Scalable power control for many-core architectures running multi-threaded applications. |
98 | Prefetch-aware shared resource management for multi-core systems. |
97 | Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. |
88 | A case for heterogeneous on-chip interconnects for CMPs. |
79 | An abacus turn model for time/space-efficient reconfigurable routing. |
74 | Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs. |
74 | Bypass and insertion algorithms for exclusive last-level caches. |
69 | Crafting a usable microkernel, processor, and I/O system with strict and provable information flow security. |
66 | Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput. |
66 | The role of optics in future high radix switch design. |
65 | Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems. |
60 | SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. |
59 | Rapid identification of architectural bottlenecks via precise event counting. |
52 | SpecTLB: a mechanism for speculative address translation. |
51 | Sampling + DMR: practical and low-overhead permanent fault detection. |
47 | i-NVMM: a secure non-volatile main memory system with incremental encryption. |
39 | CPPC: correctable parity protected cache. |
38 | Automatic abstraction and fault tolerance in cortical microachitectures. |
36 | TLSync: support for multiple fast barriers using on-chip transmission lines. |
36 | Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management. |
34 | Virtualizing performance asymmetric multi-core systems. |
33 | Demand-driven software race detection using hardware performance counters. |
31 | A case for globally shared-medium on-chip interconnect. |
25 | Rebound: scalable checkpointing for coherent shared memory. |
21 | Moguls: a model to explore the memory hierarchy for bandwidth improvements. |
19 | OUTRIDER: efficient memory latency tolerance with decoupled strands. |
13 | CRIB: consolidated rename, issue, and bypass. |
7 | FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes. |
3 | Releasing efficient beta cores to market early. |
2010¶
Cited by | Paper title |
---|---|
756 | Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. |
413 | High performance cache replacement using re-reference interval prediction (RRIP). |
389 | An integrated GPU power and performance model. |
361 | Energy proportional datacenter networks. |
318 | Understanding sources of inefficiency in general-purpose chips. |
254 | NoHype: virtualized cloud infrastructure without the virtualization. |
232 | Use ECP, not ECC, for hard failures in resistive memories. |
211 | Rethinking DRAM design and organization for energy-constrained multi-cores. |
195 | Relax: an architectural framework for software recovery of hardware faults. |
189 | Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. |
186 | Dynamic warp subdivision for integrated branch and memory divergence tolerance. |
173 | Reducing cache power with low-cost, multi-bit error-correcting codes. |
172 | Web search using mobile cores: quantifying and mitigating the price of efficiency. |
143 | Morphable memory system: a robust architecture for exploiting multi-level phase change memories. |
141 | Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing. |
139 | Aérgia: exploiting packet latency slack in on-chip networks. |
124 | A dynamically configurable coprocessor for convolutional neural networks. |
124 | Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. |
110 | Translation caching: skip, don’t walk (the page table). |
100 | Re-architecting DRAM memory systems with monolithically integrated silicon photonics. |
99 | The impact of management operations on the virtualized datacenter. |
99 | SieveStore: a highly-selective, ensemble-level disk cache for cost-performance. |
94 | Modeling critical sections in Amdahl’s law and its implications for multicore design. |
89 | Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. |
88 | The virtual write queue: coordinating DRAM and last-level cache policies. |
86 | Silicon-photonic network architectures for scalable, power-efficient multi-chip systems. |
84 | Evolution of thread-level parallelism in desktop applications. |
77 | An intra-chip free-space optical interconnect. |
73 | Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. |
68 | Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. |
65 | A case for FAME: FPGA architecture model execution. |
60 | ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. |
55 | Cohesion: a hybrid memory model for accelerators. |
52 | WiDGET: Wisconsin decoupled grid execution tiles. |
42 | Using hardware vulnerability factors to enhance AVF analysis. |
39 | Forwardflow: a scalable core for power-constrained CMPs. |
39 | RETCON: transactional repair without replay. |
38 | LReplay: a pending period based deterministic replay scheme. |
31 | Data marshaling for multi-core architectures. |
29 | Timetraveler: exploiting acyclic races for optimizing memory race recording. |
20 | Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors. |
17 | Necromancer: enhancing system throughput by animating dead cores. |
15 | Sentry: light-weight auxiliary memory access control. |
13 | The rebirth of neural networks. |
7 | Moving the needle, computer architecture research in academe and industry. |
5 | Improving the future by examining the past. |
4 | IVEC: off-chip memory integrity protection for both security and reliability. |
2 | Shared caches in multicores: the good, the bad, and the ugly. |
2009¶
Cited by | Paper title |
---|---|
937 | Scalable high performance main memory system using phase-change memory technology. |
875 | Architecting phase change memory as a scalable dram alternative. |
644 | A durable and energy efficient main memory using phase change memory technology. |
532 | An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. |
380 | Reactive NUCA: near-optimal block placement and replication in distributed caches. |
337 | Firefly: illuminating future network-on-chip with nanophotonics. |
316 | A case for bufferless routing in on-chip networks. |
306 | Hybrid cache architecture with disparate memory technologies. |
267 | PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. |
238 | Hardware support for WCET analysis of hard real-time multicore systems. |
236 | Scaling the bandwidth wall: challenges in and avenues for CMP scaling. |
225 | Thread motion: fine-grained power management for multi-core systems. |
190 | The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization. |
174 | Temperature-constrained power control for chip multiprocessors with online model estimation. |
170 | Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. |
169 | Phastlane: a rapid transit optical routing network. |
159 | Rigel: an architecture and scalable programming interface for a 1000-core accelerator. |
154 | Disaggregated memory for expansion and sharing in blade servers. |
137 | Achieving predictable performance through better memory controller placement in many-core CMPs. |
137 | A case for an interleaving constrained shared-memory multi-processor. |
124 | Architectural core salvaging in a multi-core processor for hard-error tolerance. |
106 | Memory mapped ECC: low-cost error protection for last level caches. |
100 | SigRace: signature-based data race detection. |
96 | Spatio-temporal memory streaming. |
94 | AnySP: anytime anywhere anyway signal processing. |
90 | InvisiFence: performance-transparent memory ordering in conventional multiprocessors. |
74 | Application-aware deadlock-free oblivious routing. |
70 | Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices. |
66 | Indirect adaptive routing on large scale interconnection networks. |
62 | Internet-scale service infrastructure efficiency. |
60 | Simultaneous speculative threading: a novel pipeline architecture implemented in sun’s rock processor. |
46 | Stream chaining: exploiting multiple levels of correlation in data prefetching. |
44 | Multi-execution: multicore caching for data-similar executions. |
41 | Performance and power of cache-based reconfigurable computing. |
40 | A fault tolerant, area efficient architecture for Shor’s factoring algorithm. |
40 | Dynamic performance tuning for speculative threads. |
36 | ECMon: exposing cache events for monitoring. |
29 | Boosting single-thread performance in multi-core systems through fine-grain multi-threading. |
25 | Flexible reference-counting-based hardware acceleration for garbage collection. |
25 | A memory system design framework: creating smart memories. |
21 | Dynamic MIPS rate stabilization in out-of-order processors. |
19 | End-to-end register data-flow continuous self-test. |
14 | Ten ways to waste a parallel computer. |
11 | End-to-end performance forecasting: finding bottlenecks before they happen. |
10 | Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors. |
2008¶
Cited by | Paper title |
---|---|
625 | Corona: System Implications of Emerging Nanophotonic Technology. |
588 | 3D-Stacked Memory Architectures for Multi-core Processors. |
448 | Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. |
367 | Technology-Driven, Highly-Scalable Dragonfly Topology. |
292 | Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors. |
290 | Improving NAND Flash Based Disk Caches. |
288 | Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. |
225 | Trading off Cache Capacity for Reliability to Enable Low Voltage Operation. |
222 | Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support. |
221 | MIRA: A Multi-layered On-Chip Interconnect Router Architecture. |
217 | DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Effciently. |
217 | A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. |
216 | Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. |
179 | Rerun: Exploiting Episodes for Lightweight Memory Race Recording. |
160 | Flexible Decoupled Transactional Memory Support. |
150 | Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. |
134 | TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. |
130 | Flexible Hardware Acceleration for Instruction-Grain Program Monitoring. |
118 | Atom-Aid: Detecting and Surviving Atomicity Violations. |
113 | Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory. |
113 | VEAL: Virtualized Execution Accelerator for Loops. |
91 | A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime. |
78 | ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency. |
75 | Online Estimation of Architectural Vulnerability Factor for Soft Errors. |
71 | iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures. |
69 | Polymorphic On-Chip Networks. |
45 | Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction. |
40 | Achieving Out-of-Order Performance with Almost In-Order Complexity. |
39 | From Speculation to Security: Practical and Efficient Information Flow Tracking Using Speculative Hardware. |
38 | Atomic Vector Operations on Chip Multiprocessors. |
37 | Software-Controlled Priority Characterization of POWER5 Processor. |
35 | Running a Quantum Circuit at the Speed of Data. |
33 | Intra-disk Parallelism: An Idea Whose Time Has Come. |
15 | Counting Dependence Predictors. |
15 | Microcoded Architectures for Ion-Tap Quantum Computers. |
9 | A Two-Level Load/Store Queue Based on Execution Locality. |
6 | Fetch-Criticality Reduction through Control Independence. |
2007¶
Cited by | Paper title |
---|---|
1553 | Power provisioning for a warehouse-sized computer. |
547 | Adaptive insertion policies for high performance caching. |
451 | Anton, a special-purpose machine for molecular dynamics simulation. |
368 | Express virtual channels: towards the ideal interconnection fabric. |
365 | An effective hybrid transactional memory system with strong isolation guarantees. |
363 | Flattened butterfly: a cost-efficient topology for high-radix networks. |
294 | Core fusion: accommodating software diversity in chip multiprocessors. |
285 | Raksha: a flexible information flow architecture for software security. |
265 | A novel dimensionally-decomposed router for on-chip communication in 3D architectures. |
263 | New cache designs for thwarting software cache-based side channel attacks. |
258 | Performance pathologies in hardware transactional memory. |
225 | Carbon: architectural support for fine-grained parallelism on chip multiprocessors. |
223 | BulkSC: bulk enforcement of sequential consistency. |
207 | Virtual hierarchies to support server consolidation. |
192 | Configurable isolation: building high availability systems with commodity multi-core processors. |
192 | Virtual private caches. |
162 | Making the fast case common and the uncommon case simple in unbounded transactional memory. |
157 | Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. |
153 | ReCycle: : pipeline adaptation to tolerate process variation. |
138 | An integrated hardware-software approach to flexible transactional memory. |
137 | Limiting the power consumption of main memory. |
127 | Comparing memory systems for chip multiprocessors. |
125 | Dynamic prediction of architectural vulnerability from microarchitectural state. |
109 | Interconnect design considerations for large NUCA caches. |
103 | Examining ACE analysis reliability estimates using fault-injection. |
99 | Mechanisms for store-wait-free multiprocessors. |
96 | MetaTM//TxLinux: transactional memory for an operating system. |
90 | Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. |
88 | Rotary router: an efficient architecture for CMP interconnection networks. |
82 | Hardware atomicity for reliable software speculation. |
76 | Power model validation through thermal measurements. |
72 | A 64-bit stream processor architecture for scientific applications. |
69 | Automated design of application specific superscalar processors: an analytical approach. |
64 | Mechanisms for bounding vulnerabilities of processor structures. |
57 | Thermal modeling and management of DRAM memory systems. |
48 | Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors. |
47 | ParallAX: an architecture for real-time physics. |
45 | Late-binding: enabling unordered load-store queues. |
43 | VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. |
39 | Matrix scheduler reloaded. |
36 | Transparent control independence (TCI). |
29 | Aquacore: a programmable architecture for microfluidics. |
25 | Ginger: control independence using tag rewriting. |
13 | Performance and security lessons learned from virtualizing the alpha processor. |
11 | Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits. |
10 | Architectural implications of brick and mortar silicon manufacturing. |
2006¶
Cited by | Paper title |
---|---|
539 | Techniques for Multicore Thermal Management: Classification and New Exploration. |
477 | Cooperative Caching for Chip Multiprocessors. |
427 | Design and Management of 3D Chip Multiprocessors Using Network-in-Memory. |
376 | Ensemble-level Power Management for Dense Blade Servers. |
327 | Bulk Disambiguation of Speculative Threads in Multiprocessors. |
289 | A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. |
267 | A Case for MLP-Aware Cache Replacement. |
253 | SODA: A Low-power Architecture For Software Radio. |
229 | A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks. |
217 | Architectural Semantics for Practical Transactional Memory. |
208 | The BlackWidow High-Radix Clos Network. |
165 | Spatial Memory Streaming. |
139 | Improving Cost, Performance, and Security of Memory Encryption and Authentication. |
131 | Interconnect-Aware Coherence Protocols for Chip Multiprocessors. |
120 | Learning-Based SMT Processor Resource Distribution via Hill-Climbing. |
120 | TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time. |
70 | Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture. |
68 | Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs. |
62 | Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches. |
60 | An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors. |
59 | Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. |
54 | Memory Model = Instruction Reordering + Store Atomicity. |
51 | Quantum Memory Hierarchies: Efficient Designs to Match Available Parallelism in Quantum Computing. |
44 | Multiple Instruction Stream Processor. |
42 | Area-Performance Trade-offs in Tiled Dataflow Architectures. |
39 | Reducing Startup Time in Co-Designed Virtual Machines. |
33 | Tolerating Dependences Between Large Speculative Threads Via Sub-Threads. |
30 | Conditional Memory Ordering. |
30 | Interconnection Networks for Scalable Quantum Computers. |
24 | Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification. |
21 | Distributed Arithmetic on a Quantum Multicomputer. |
17 | The Future of Virtualization Technology. |
6 | The End of Scaling? Revolutions in Technology and Microarchitecture as We Pass the 90 Nanometer Node. |
1 | Computer Architecture Research and Future Microprocessors: Where Do We Go from Here? |
0 | Message from the General Chair. |
0 | Message from the Program Chair. |
0 | SIGARCH Guidelines. |
2005¶
Cited by | Paper title |
---|---|
610 | Continuous Optimization. |
511 | Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. |
497 | Virtualizing Transactional Memory. |
411 | Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. |
341 | BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. |
341 | Optimizing Replication, Communication, and Capacity Allocation in CMPs. |
334 | A High Throughput String Matching Architecture for Intrusion Detection and Prevention. |
305 | The Impact of Performance Asymmetry in Emerging Multicore Architectures. |
263 | Mitigating Amdahl’s Law through EPI Throttling. |
232 | Microarchitecture of a High-Radix Router. |
230 | Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions. |
227 | Exploiting Structural Duplication for Lifetime Reliability Enhancement. |
220 | The V-Way Cache: Demand Based Associativity via Global Replacement. |
215 | Computing Architectural Vulnerability Factors for Address-Based Structures. |
211 | Architecture for Protecting Critical Secrets in Microprocessors. |
195 | Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks. |
195 | An Ultra Low Power System Architecture for Sensor Network Applications. |
191 | RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. |
178 | An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors. |
150 | Design and Evaluation of Hybrid Fault-Detection Systems. |
148 | Opportunistic Transient-Fault Detection. |
148 | Direct Cache Access for High Bandwidth Network I/O. |
125 | A Robust Main-Memory Compression Scheme. |
124 | Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. |
122 | Temporal Streaming of Shared Memory. |
118 | Analysis of the O-GEometric History Length Branch Predictor. |
109 | Energy Optimization of Subthreshold-Voltage Sensor Network Processors. |
101 | Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors. |
99 | High Efficiency Counter Mode Security Architecture via Prediction and Precomputation. |
98 | A Tree Based Router Search Engine Architecture with Single Port Memories. |
95 | Piecewise Linear Branch Prediction. |
82 | Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management. |
80 | Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. |
76 | Rescue: A Microarchitecture for Testability and Defect Tolerance. |
70 | Scalable Load and Store Processing in Latency Tolerant Processors. |
69 | Techniques for Efficient Processing in Runahead Execution Engines. |
64 | An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems. |
59 | Deconstructing Commodity Storage Clusters. |
53 | An Evaluation Framework and Instruction Set Architecture for Ion-Trap Based Quantum Micro-Architectures. |
46 | RENO - A Rename-Based Instruction Optimizer. |
44 | Store Buffer Design in First-Level Multibanked Data Caches. |
42 | Improving Program Efficiency by Packing Instructions into Registers. |
35 | Dynamic Verification of Sequential Consistency. |
9 | Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines. |
7 | Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection. |