## papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

to read: Google AI: optimizing multiple loss functions Google AI: reducing gender bias in Google Translate Zoom In: An Introduction to Circuits Google AI: Neural Tangents Google AI: TensorFlow Quantum SLIDE (fast CPU training) Google AI: Reformer lottery ticket initialization Google AI: out-of-distribution detection Large-Scale Multilingual Speech Recognition with E2E model E2E ASR from raw waveform Machine Theory of Mind Normalizing Flows Glow networks A Theory of Local Learning, the Learning Channel, and the Optimality of Backpropagation Why and When Deep Networks Avoid the Curse of Dimensionality Diversity is All You Need (Learning Skills without a Reward Function) World Models Relational inductive biases, deep learning, and graph networks Loss Surfaces of Multilayer Networks Visualizing the Loss Landscape of Neural Nets The Matrix Calculus You Need for Deep Learning Group Normalization Layer Normalization Artificial Intelligence Meets Natural Stupidity Qualitatively characterizing neural network optimization problems Strong Inference A learning algorithm for continually running fully recurrent neural networks Adaptive multi-level hyper-gradient descent Rotate your networks: better weight consolidation and less catastrophic forgetting Attention is not *all* you need When BERT plays the lottery, all tickets are winning

This is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms
Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables.
the algorithm There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\).

Read more
Theoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn’t seen elevations that low before.
The main contribution of this paper is that they’ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network.

Read more
The theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.

Read more
This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting.
experiments for detecting easy/hard samples It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data.

Read more
The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity.
Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.

Read more