Jump in the discussion.

No email address required.

>trained on stories containing child abuse

What, like game of thrones?

Jump in the discussion.

No email address required.

The Quran :derpsnickering:

Jump in the discussion.

No email address required.

aisha being nine is from a hadith thats rated probably authentic. its still awful though.

Jump in the discussion.

No email address required.

:#marseyakshually:

These URLs would then be sent to PhotoDNA to detect if any known CSAM was present, and matches would be sent to the Project Arachnid Shield API to have the results validated by Canadian Centre for Child Protection (C3P).25 Once instances of CSAM were verified, we would use their image embeddings26 to run k‐nearest neighbors (KNN) queries27 to find related images in the dataset.

I'm not entirely sure how k-nearest neighbors queries work.... What exactly makes another link similar enough? :marseyhmm:

After initial validation, we also implemented industry MD5 hash sets provided by the National Center for Missing and Exploited Children (NCMEC)28 and a CSAM classifier provided by Thorn29 to discover additional content not detectable via PhotoDNA.

They used government-funded file servers full of CP!!!

:marseyschizowall:

Jump in the discussion.

No email address required.

They use knearest neihbours on the image embeddings, not the links. Image embeddings already cluster by similarity in the embedding space when they're created so the nearest neighbours are visually similar. Visual similarity can mean a lot of things though, maybe useful to find other instances of the same image with different compression artifacts but I don't know if CSAM would be most visually similar to other CSAM, rather than legal pictures with similar color distributions and visual components.

Jump in the discussion.

No email address required.

Thank you, my neighbour! :marseypeace:

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

>Just sanitize your 6B pictures dataset

Ayy lemme just get my 99.99999999% accurate cp remover, that I'm supposed to have trained somehow.

Jump in the discussion.

No email address required.

The research by the Stanford Internet Observatory showed it's possible, although inefficient. Besides, making a scrape of the entire internet (which is how the dataset was created) is likely to produce much more images that you don't want for other reasons, such as watermarked, blurry, or mislabeled images which could be avoided if there was any care given when creating the dataset. This has happened many times before, an example being when Latitude's database used to train AIdungeon (which was only 29 MB) contained similar material in text form, yet claimed their dataset was clean and that players were at fault.

Jump in the discussion.

No email address required.

>Stanford Internet Observatory showed it's possible

Confirmed pedos.

Jump in the discussion.

No email address required.

>although inefficient

how much? Prohibitively so? So much that the west will lose it's lead to china who won't care about this stuff anyways?

Jump in the discussion.

No email address required.

China has their own set of no-no's they have to filter out, and Chinese people are way more clever than westerners at throwing shade on their government.

Jump in the discussion.

No email address required.

Even theirs is pretty shitty. Probably took 120 hours to get it right, and I'd be surprised if they shared their source code.

Jump in the discussion.

No email address required.

:marseyreading:

PDF

this is of course compounded by the presence of dozens of languages in the dataset, many of which may use native language terms or slang that translate poorly. As an example, even a commonly used term such as “loli”23 in Japanese (ロリ) was frequently translated as the name “Lori”, or occasionally the word “LOL”.

i.e., actual CSAM entries in the dataset may have generic‐sounding labels while explicit

material depicting adults may commonly have ambiguous indicators of youth (teen, schoolgirl, twink, etc). The text descriptions for the majority of initial PhotoDNA hits used generic captions that could apply to either legal or illegal material; therefore we conclude that at least for English language material, text descriptions are of limited utility for identifying CSAM.24

!anime bros, the lolis remain undetected. :marseysweating: They're also onto "twinks" but "femboy" remains safe. :marseylgbtflag: :marseyfemboy:

LAION datasets do not include the actual images; instead, they include a link to the original image on the site from which it was scraped.

:marseybeanrelieved:

Thank goodness, it's only links to child porn...

Given that multiple years have elapsed between the time the content was scraped and processed, a large percentage of the URLs passed to PhotoDNA (≈30%) were reported as no longer being active. They may, however, have been used to train models before they were removed from their original URLs, and some likely continue to reside in versions of the datasets retrieved at earlier dates.

:#marseyveryworriedfed:

Officer, officer, I only used the dataset to generate mature catgirls. I didn't know it was trained on loli catgirls! :marseysweating:

They found about 200 links to CSAM, out of the gigatons of data it was trained on. :marseyshrug:

Using the CSAM classifier provided by Thorn on the remaining neighbors, 575 results were strongly predicted to be CSAM (99% or higher probability). These were submitted to PhotoDNA for scanning, resulting in 18 matches.

It's amusing how their method with a 99%+ probability still has a false positive rate of 97%. :marseyoperasmug:

I wonder if the FBI uses similar cowtools and happens to find "CP" everywhere. :marseyhmm:

Jump in the discussion.

No email address required.

I wonder if the FBI uses similar cowtools and happens to find "CP" everywhere.

Why would they bother? The FBI has literal tons of actual CP. A small USB-drive inserted into a device of an undesirable once they confiscate them is much easier for them. :marseyshrug:

Jump in the discussion.

No email address required.

The FBI doesn't need to find anything, when they can just put a Windows95 laptop full of it in the house of someone who speaks up against a Federal Agent shooting 200 people at a concert

Jump in the discussion.

No email address required.

Reported by:

Neighbor there's literally 5.8 BILLION images in that dataset. Less than 0.00001% are cp

Jump in the discussion.

No email address required.

The bar is zero.

Jump in the discussion.

No email address required.

<200 of 5.8 billion.

But that's after using LAION's previous method for removing CP, I think.

Jump in the discussion.

No email address required.

Of course they're German.

Jump in the discussion.

No email address required.

That was part of the joke in the title, as well as including the logo in my post (It looks like something furries would make)

Jump in the discussion.

No email address required.

A company produces paintings outside, out of every few million paintings a bird shits on one.

What is more reasonable? Hire a birdshit detector to manually inspect every painting one by one—-or to quickly kill the pigeons in the park that shit on the paintings?

Jump in the discussion.

No email address required.

i remember playing a wizard in ai dungeon and it just making me wake up as a child about to be abused. that shit was fricked.

Jump in the discussion.

No email address required.

https://i.rdrama.net/images/17037293014647088.webp

I'm not sure why you agree to be the face of an article with a title like that lol


Don't forget to turn off signatures in settings!

Jump in the discussion.

No email address required.

>german company produces p-do shit

this should not surprise you. every german is a p-dophile. they have sent orphans to live with p-dophiles. they are subhuman trash. each and every germoid is complicit in state sanctioned pedophillia and should be shot.

|Death to germoids|


Give me your money and I'll annoy people with it :space: https://i.rdrama.net/images/16965516366194396.webp

Jump in the discussion.

No email address required.

It's actually very difficult to sanitise these datasets and this is a complicated subj... just kidding, immediate:#marseywoodchipper2: for all involved


https://i.rdrama.net/images/172187329668082.webp

Jump in the discussion.

No email address required.

:#marseyrcaconnectingfast:

Snapshots:

Jump in the discussion.

No email address required.

um what did he mean by this??

Jump in the discussion.

No email address required.

unplug AI

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.