The LAION-5B, a significant artificial intelligence dataset commonly used to train text-to-image generators like Stable Diffusion and Imagen, has been taken down by its creator, the German nonprofit organization Large-scale Artificial Intelligence Open Network (LAION). This action follows a study revealing the presence of thousands of instances of suspected child sexual abuse material (CSAM) within the dataset.
Researchers from the Stanford Internet Observatory’s Cyber Policy Center identified 3,226 instances of suspected CSAM in the LAION-5B dataset, with confirmation by third parties. While the study notes that the presence of CSAM may not drastically impact the output of models trained on the dataset, it could still exert some influence, especially given the repetition of identical instances, reinforcing images of specific victims.
LAION, known for creating open-sourced AI models and datasets, including LAION-5B with 5.85 billion image-text pairs released in March 2022, issued a statement expressing its commitment to caution.
LAION has removed both LAION-5B and LAION-400M datasets to ensure their safety before considering republishing. The move reflects a proactive stance to address potential ethical concerns arising from the inadvertent inclusion of inappropriate content in the AI training dataset.
