Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note llama's feed forward is a bit different too:

  self.w2(F.silu(self.w1(x)) * self.w3(x))
I.e. the nonlinearity is a gate.

https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...



Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU is gated linear unit; the gate construction you describe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: