Got the pointer to this from Allison Parrish who says it better than I could:
it’s a very compelling paper, with a super clever methodology, and (i’m paraphrasing/extrapolating) shows that “alignment” strategies like RLHF only work to ensure that it never seems like a white person is saying something overtly racist, rather than addressing the actual prejudice baked into the model.
I don’t think anyone is surprised, but brace yourself for the next round of OpenAI and peers claiming to fix this issue.