Riya Bisht

Research Paper Readings #

Basics of Reconfigurable Computing- Reiner Hartenstein, TU Kaiserslautern

The personal supercomputer is near. The Munich-based PACT Corp. [193] with its coarse-grained reconfigurable XPA (Xtreme Processing Array) product (also see fig. 30) has demonstrated, that a 56 core 16-bit rDPA running at less than 500 MHz can host simultaneously everything needed for a world TV controller, like multiple standards, all types of conversions, (de)compaction, image improvements and repair, all sizes and technologies of screens, and all kinds of communication including wireless. More high performance by less CPUs, by reconfigurable units instead of CPUs.

Efficient Processing of Deep Neural Networks: A Tutorial and Survey Sat 18, Jan

Due to the popularity of DNNs, many recent hardware platforms have special features that target DNN processing. For instance, the Intel Knights Landing CPU features special vector instructions for deep learning; the Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic support to perform two FP16 operations on a single precision core for faster deep learning computation. Systems have also been built specifically for DNN processing such as Nvidia DGX-1 and Facebook’s Big Basin custom DNN server [ 73 ]. DNN inference has also been demonstrated on various embedded System-on- Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs. Accordingly, it’s important to have a good understanding of how the processing is being performed on these platforms, and how application-specific accelerators can be designed for DNNs for further improvement in throughput and energy efficiency.

Deep Differentiable Logic Gate Networks

Logic gate networks are networks similar to neural networks where each neuron is represented by a binary logic gate like ‘and’, ‘nand’, and ‘nor’ and accordingly has only two inputs (instead of all neurons in the previous layer as it is the case in fully-connected neural networks). Given a binary vector as input, pairs of Boolean values are selected, binary logic gates are applied to them, and their output is then used as input for layers further downstream. Logic gate networks do not use weights. Instead, they are parameterized via the choice of logic gate at each neuron. In contrast to fully connected neural networks, binary logic gate networks are sparse because each neuron has only 2 instead of n inputs, where n is the number of neurons per layer. In logic gate networks, we do not need activation functions as they are intrinsically non-linear.

Convolutional Differentiable Logic Gate Networks

In this work, we propose to convolve activations A with differentiable binary logic gate trees. While we could convolve A with an individual logic gate, we observe that actually convolving A with a (deep) logic gate network or tree leads to substantially better performance as it allows for greater expressivity of the model. Similar to how the inputs to each logic gate are randomly initialized and remain fixed in conventional differentiable LGNs, we randomly construct the connections in our logic gate tree kernel function. However, we need to put additional restrictions on the connections for logic gate network kernels. Specifically, we construct each logic gate network kernel as a complete binary tree of depth d with logic gates as nodes and binary input activations as leaves. The output of the logic gate operation is then the input to the next higher node, etc. To capture spatial patterns, we select the inputs / leaves of the tree from the predefined receptive field of the kernel of size sh × sw. Based on the depth of the tree, we randomly select as many inputs as necessary. For example, we could construct a binary tree of depth d = 2, which means that we need to randomly select 2d = 4 inputs from our receptive field, e.g., of size 64 × 3 × 3, which corresponds to 64 input channels with a kernel size of 3 × 3.

LUTNet: Rethinking Inference in FPGA Soft Logic

In this paper, we introduced LUTNet: the first DNN archi- tecture featuring K-LUTs as inference operators specifically designed to suit FPGA implementation. Our novel training approach results in the construction of K-LUT-based networks robust to high levels of pruning with little or no accuracy degradation, enabling the achievement of significantly higher area and energy efficiencies than that of traditional BNNs. Our results for 4-LUT-based inference operators reveal area compression of 2.08× and 1.90× for the CNV network classifying the CIFAR-10 dataset and AlexNet classifying ImageNet, respectively, against an unrolled and losslessly pruned implementation of ReBNet [5], the state-of-the-art BNN, with accuracy bounded within ±0.300 percentage points (pp).

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

The latest Titan X Pascal offers peak 11 TFLOP/s of 32-bit floating-point throughput, a noticeable improvement from the previous generation Titan X Maxwell that offered 7 TFLOP/s peak throughput. However, GPUs can face challenges from issues, such as divergence, for computation that exhibits irregular parallelism. Further, GPUs support only a fixed set of native data types. So, other custom-defined data types may not be handled efficiently. These challenges may lead to underutilization of hardware resources and unsatisfactory achieved performance. Meanwhile, FPGAs have advanced significantly in recent years. There are several FPGA trends to consider. First, there are much more on-chip RAMs on next-generation FPGAs. For example, Stratix 10 [17] offers up to ~28 MBs worth of on-chip RAMs (M20Ks). Second, frequency can improve dramatically, enabled by technologies such as HyperFlex. Third, there are many more hard DSPs available. Fourth, off-chip bandwidth will also increase, with the integration of HBM memory technologies. Fifth, these next-generation FPGAs use more advanced process technology (e.g., Stratix 10 uses 14nm Intel technology). Overall, it is expected that Intel Stratix 10 can offer up to 9.2 TFLOP/s of 32bit floating-point performance. This brings FPGAs closer in raw performance to state-of-the-art GPUs. Unlike GPUs, the FPGA fabric architecture was made with extreme customizability in mind, even down to bit-levels. Hence, FPGAs have the opportunity to do increasingly well on the next-generation DNNs as they become more irregular and use custom data types.

Ten Lessons From Three Generations Shaped Google’s TPUv4i

In the process, we learned ten lessons about DSAs and DNNs in general and about DNN DSAs specifically that shaped the design of TPUv4i: i. Logic improves more quickly than wires and SRAM ⇒ TPUv4i has 4 MXUs per core vs 2 for TPUv3 and 1 for TPUv1/v2.

ii. Leverage existing compiler optimizations ⇒ TPUv4i evolved from TPUv3 instead of being a brand new ISA.

iii. Design for perf/TCO instead of perf/CapEx ⇒ TDP is low, CMEM/HBM are fast, and the die is not big.

iv. Backwards ML compatibility enables rapid deployment of trained DNNs ⇒TPUv4i supports bf16 and avoids arithmetic problems by looking like TPUv3 from the XLA compiler’s perspective.

v. Inference DSAs need air cooling for global scale ⇒ Its design and 1.0 GHz clock lowers its TDP to 175W.

vi. Some inference apps need floating point arithmetic ⇒ It supports bf16 and int8, so quantization is optional.

vii. Production inference normally needs multi-tenancy ⇒ TPUv4i’s HBM capacity can support multiple tenants.

viii. DNNs grow ~1.5x annually in memory and compute ⇒ To support DNN growth, TPUv4i has 4 MXUs, fast on- and off-chip memory, and ICI to link 4 adjacent TPUs.

ix. DNN workloads evolve with DNN breakthroughs ⇒ Its programmability and software stack help pace DNNs.

x. The inference SLO is P99 latency, not batch size ⇒ Backwards ML compatible training tailors DNNs to TPUv4i, yielding batch sizes of 8–128 that raise throughput and meet SLOs. Applications do not restrict batch size.

TL;DR

Power matters a lot now—TCO is tightly tied to energy use.

Memory accesses are more expensive than compute (especially low-precision compute).

Developers should rethink the "minimize FLOPs" mentality.

Co-design across hardware/software/DNN is the path forward to efficiency.

Stochastic Computation + LookUp table approach => Stochatic LUTNet, a class of hardware aware deep neural networks that optimizes the number of hardware resources(like multipliers) used by a deep neural network.

Current Research Interests #

Weightless Neural Networks: An Efficient Edge Inference Architecture:

Mainstream artificial neural network models, such as Deep Neural Networks (DNNs) are computation-heavy and energy-hungry. Weightless Neural Networks (WNNs) are natively built with RAM-based neurons and represent an entirely distinct type of neural network computing compared to DNNs. WNNs are extremely low-latency, low-energy, and suitable for efficient, accurate, edge inference. The WNN approach derives an implicit inspiration from the decoding process observed in the dendritic trees of biological neurons, making neurons based on Random Access Memories (RAMs) and/or Lookup Tables (LUTs) ready-to-deploy neuromorphic digital circuits. Since FPGAs are abundant in LUTs, LUT based WNNs are a natural fit for implementing edge inference in FPGAs.

WNNs has been demonstrated to be an energetically efficient AI model, both in software, as well as in hardware. For instance, the most recent DWN- Differential Weightless Neural Network model demonstrates up to 135× reduction in energy costs in FPGA implementations compared to other multiplication-free approaches, such as binary neural networks (BNNs) and DiffLogicNet, up to 9% higher accuracy in deployments on constrained devices, and culminate in up to 42.8× reduction in circuit area for ultra-low-cost chip implementations. This tutorial will help participants understand how WNNs work, why WNNs were underdogs for such a long time, and be introduced to the most recent members of the WNN family, such as BTHOWeN , LogicWiSARD, COIN, ULEEN and DWN, and contrast to BNNs and LogicNets.
Stochastic Computing(SC)

Some curated thoughts on research