How to make an adversary resistant neural network

Bram Cohen
3 min readJan 31, 2021

--

I previously pontificated about the difficulties of making a neural network robust in the adversarial model and a general approach to overcoming them. In this post I’ll flesh out the specifics of the construction a bit more.

What we want to do is make a neural network whose overall structure naturally gives it adversary resistance, so we can simply train it as well as we can and it automatically gets that feature, rather than having to do something special in the training, which doesn’t seem to work.

There are two tricks here: First, the number of outputs is about the same as the number of inputs and the final value is the sum of them. This neatly fixes the adversarial control problem but leads to the question of how to make the outputs not always get trained the same way. The second trick fixes that problem by making a middle layer of values where every middle layer value is calculated from a different subset of the inputs and every output is calculated from a different subset of the middle layer in such a way that every output depends on every input but they get there in such radically different ways that they can’t possibly give values which are all that correlated.

Now for the fun math part of this construction: First find the smallest square prime P² greater than the number of inputs and arrange the inputs randomly into a square of length P on a side. (A few ‘input’ values will have to be ‘missing’ because of the numerical mismatch, but that’s a small amount and doesn’t matter.) We define a ‘line’ through the inputs as all the values (A+Z*B,C+Z*D) (mod P) varying Z and holding A-D fixed. Each middle layer is the output of a neural network using the values falling on one line as its inputs. Since there are P²+P of those, the number of middle layer values is slightly more than the number of original inputs.

Each output is then the result of a neural network which uses all the middle layer values which go through some fixed (E, F). This results in all but one of the input values getting into the final output via exactly one path, with the slightly odd mathematical ‘wart’ that there’s a single value which gets included many times, which seems to be necessary for the construction. This results in P² total outputs, which again is slightly more than the number of inputs. (You can also group together all parallel lines at the same angle, making a total of P²+P+1 outputs.)

An example of the inputs for the final output corresponding to the black square. Each middle layer line is a different color

My suspicion is doing the above verbatim won’t be able to recognize well no matter how well you train it because the inputs and outputs are simply too disjointed. That probably can be fixed by making it so each middle layer value is actually several values. If that number is too small then it won’t be able to do recognition, and if that value is too big then everything in the middle layer contains too much of the inputs and all the outputs will wind up correlated again, so it’s necessary to tune the amount to hit the sweet spot.

One interesting aspect of interpreting the outputs of this sort of thing is that there’s a likely to be a wide range of values above ‘overwhelming evidence in the non-adversarial model’ but below ‘convincing evidence in the adversarial model’. I think this is very much a real thing and correlates to what we mean when we say our own personal neural network ‘imagined’ something. Having an output which is a simple probability only works in the non-adversarial model, as soon as you get into the adversarial world you have to start making assumptions about how strong your adversary might be.

--

--