Is ChatGPT a Chatbot of Privacy Violation and Piracy?

Recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal. The research also states that OpenAI’s filter to check online content for quality does not work sufficiently for Dutch-language content.

Multilingual Datasets

The Common Crawl dataset is well known and pretty much summarizes the entire internet. It is available in mC4 in multiple languages. Google created the dataset and it was much more difficult to obtain than an English language dataset. According to Google researchers, it was enough for C4, the English-language dataset, to include the available digital content as of April 2019. mC4 required aggregating 71 monthly web scrapes from Common Crawl.
Google demonstrated the usefulness of the dataset in its Natural Language Processing (NLP) language model mT5. All code and training sets are publicly available. The researchers argue this choice as follows: “We are releasing all code and pre-trained datasets used in this paper to simplify future work on multilingualism research.”

Pirate Site Leader

It is likely that this dataset also forms the basis for GPT-3 and therefore ChatGPT. Multilingual data sets are difficult to compile and are therefore not numerous. De Groene Amsterdammer went to work with that theory and came to the conclusion that the MC4 dataset is most likely behind the OpenAI language model. Furthermore, it was viewed and which Dutch-language websites form the basis for the training sets. It can be said that the top twenty contains at least surprising results.
The largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. It accounts for 3.6 percent of the total data set. The website is a haven for hackers, as private information such as applicant evaluation documents are freely available. The website constantly scours the internet for files. The website also contains data from data breaches, complete resumes and tax returns. It was not long before the website was found illegal by the Dutch Data Protection Authority and the National Cyber Security Center. Nevertheless, the website is still up and running.
The top three also consists of tripadvisor.nl (1.9%) and pronunciations.rechtspraak.nl (1.2%). Advertisements from private sellers also complemented the dataset well. 0.3 percent comes from ebay.nl, which ranks eleven and marktplaats.nl has a share of 0.2 percent. As a result, the language model has knowledge of many telephone numbers from advertisements on these websites.

ChatGPT: A Chatbot of Privacy Violation and Piracy?

ChatGPT speaks quite a bit of Dutch. Our language must have learned the language model itself, with Dutch data that is freely available on the internet. Usually, a company keeps the composition of training sets secret. For example, it is not known how GPT-3, the model behind ChatGPT, came about.
Recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal. The research also states that OpenAI’s filter to check online content for quality does not work sufficiently for Dutch-language content. This raises the question: Is ChatGPT a chatbot of privacy violation and piracy?

Multilingual Datasets

Pirate Site Leader

ChatGPT: A Chatbot of Privacy Violation and Piracy?

ChatGPT is a language model created by OpenAI, a research laboratory based in San Francisco. It is a chatbot that speaks Dutch and is capable of understanding and responding to conversations. It is powered by GPT-3, a natural language processing (NLP) language model. The composition of the training sets used to create GPT-3 is kept secret, but recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal.
The Common Crawl dataset is well known and pretty much summarizes the entire internet. It is available in mC4 in multiple languages. Google created the dataset and it was much more difficult to obtain than an English language dataset. According to Google researchers, it was enough for C4, the English-language dataset, to include the available digital content as of April 2019. mC4 required aggregating 71 monthly web scrapes from Common Crawl.
The research also states that OpenAI’s filter to check online content for quality does not work sufficiently for Dutch-language content. This raises the question: Is ChatGPT a chatbot of privacy violation and piracy?

Pirate Site Leader

Conclusion

ChatGPT is a chatbot that speaks Dutch and is powered by GPT-3, a natural language processing (NLP) language model. The composition of the training sets used to create GPT-3 is kept secret, but recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal. This raises the question: Is ChatGPT a chatbot of privacy violation and piracy?
The largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. It accounts for 3.6 percent of the total data set. The website is a haven for hackers, as private information such as applicant evaluation documents are freely available. The website constantly scours the internet for files. The website also contains data from data breaches, complete resumes and tax returns. It was not long before the website was found illegal by the Dutch Data Protection Authority and the National Cyber Security Center.
The research also states that