By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Security Parrot - Cyber Security News, Insights and ReviewsSecurity Parrot - Cyber Security News, Insights and Reviews
Notification
Latest News
OpenAI may use Associated Press archive for AI training
July 14, 2023
EU users can hold conversations with Google Bard from training set
July 14, 2023
Aptos, the new default font for Microsoft Office
July 14, 2023
BlackLotus UEFI bootkit sources published on GitHub
July 14, 2023
Hackers from the XDSpy cyber-espionage group attacked Russian organizations on behalf of the Ministry of Emergency Situations
July 14, 2023
Aa
  • News
  • Tutorials
  • Security InsiderComing Soon
  • Expert InsightComing Soon
Reading: ‘ChatGPT based on illegal sites, private data and piracy’
Share
Security Parrot - Cyber Security News, Insights and ReviewsSecurity Parrot - Cyber Security News, Insights and Reviews
Aa
Search
  • News
  • Tutorials
  • Security InsiderComing Soon
  • Expert InsightComing Soon
Follow US
Security Parrot - Cyber Security News, Insights and Reviews > News > ‘ChatGPT based on illegal sites, private data and piracy’
News

‘ChatGPT based on illegal sites, private data and piracy’

Last updated: 2023/06/08 at 5:35 PM
Security Parrot Editorial Team Published June 8, 2023
Share
SHARE

Is ChatGPT a Chatbot of Privacy Violation and Piracy?

Recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal. The research also states that OpenAI’s filter to check online content for quality does not work sufficiently for Dutch-language content.

Multilingual Datasets

The Common Crawl dataset is well known and pretty much summarizes the entire internet. It is available in mC4 in multiple languages. Google created the dataset and it was much more difficult to obtain than an English language dataset. According to Google researchers, it was enough for C4, the English-language dataset, to include the available digital content as of April 2019. mC4 required aggregating 71 monthly web scrapes from Common Crawl.
Google demonstrated the usefulness of the dataset in its Natural Language Processing (NLP) language model mT5. All code and training sets are publicly available. The researchers argue this choice as follows: “We are releasing all code and pre-trained datasets used in this paper to simplify future work on multilingualism research.”

Pirate Site Leader

It is likely that this dataset also forms the basis for GPT-3 and therefore ChatGPT. Multilingual data sets are difficult to compile and are therefore not numerous. De Groene Amsterdammer went to work with that theory and came to the conclusion that the MC4 dataset is most likely behind the OpenAI language model. Furthermore, it was viewed and which Dutch-language websites form the basis for the training sets. It can be said that the top twenty contains at least surprising results.
The largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. It accounts for 3.6 percent of the total data set. The website is a haven for hackers, as private information such as applicant evaluation documents are freely available. The website constantly scours the internet for files. The website also contains data from data breaches, complete resumes and tax returns. It was not long before the website was found illegal by the Dutch Data Protection Authority and the National Cyber ​​Security Center. Nevertheless, the website is still up and running.
The top three also consists of tripadvisor.nl (1.9%) and pronunciations.rechtspraak.nl (1.2%). Advertisements from private sellers also complemented the dataset well. 0.3 percent comes from ebay.nl, which ranks eleven and marktplaats.nl has a share of 0.2 percent. As a result, the language model has knowledge of many telephone numbers from advertisements on these websites.

ChatGPT: A Chatbot of Privacy Violation and Piracy?

ChatGPT speaks quite a bit of Dutch. Our language must have learned the language model itself, with Dutch data that is freely available on the internet. Usually, a company keeps the composition of training sets secret. For example, it is not known how GPT-3, the model behind ChatGPT, came about.
Recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal. The research also states that OpenAI’s filter to check online content for quality does not work sufficiently for Dutch-language content. This raises the question: Is ChatGPT a chatbot of privacy violation and piracy?

Multilingual Datasets

The Common Crawl dataset is well known and pretty much summarizes the entire internet. It is available in mC4 in multiple languages. Google created the dataset and it was much more difficult to obtain than an English language dataset. According to Google researchers, it was enough for C4, the English-language dataset, to include the available digital content as of April 2019. mC4 required aggregating 71 monthly web scrapes from Common Crawl.
Google demonstrated the usefulness of the dataset in its Natural Language Processing (NLP) language model mT5. All code and training sets are publicly available. The researchers argue this choice as follows: “We are releasing all code and pre-trained datasets used in this paper to simplify future work on multilingualism research.”

Pirate Site Leader

It is likely that this dataset also forms the basis for GPT-3 and therefore ChatGPT. Multilingual data sets are difficult to compile and are therefore not numerous. De Groene Amsterdammer went to work with that theory and came to the conclusion that the MC4 dataset is most likely behind the OpenAI language model. Furthermore, it was viewed and which Dutch-language websites form the basis for the training sets. It can be said that the top twenty contains at least surprising results.
The largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. It accounts for 3.6 percent of the total data set. The website is a haven for hackers, as private information such as applicant evaluation documents are freely available. The website constantly scours the internet for files. The website also contains data from data breaches, complete resumes and tax returns. It was not long before the website was found illegal by the Dutch Data Protection Authority and the National Cyber ​​Security Center. Nevertheless, the website is still up and running.
The top three also consists of tripadvisor.nl (1.9%) and pronunciations.rechtspraak.nl (1.2%). Advertisements from private sellers also complemented the dataset well. 0.3 percent comes from ebay.nl, which ranks eleven and marktplaats.nl has a share of 0.2 percent. As a result, the language model has knowledge of many telephone numbers from advertisements on these websites.

ChatGPT: A Chatbot of Privacy Violation and Piracy?

ChatGPT is a language model created by OpenAI, a research laboratory based in San Francisco. It is a chatbot that speaks Dutch and is capable of understanding and responding to conversations. It is powered by GPT-3, a natural language processing (NLP) language model. The composition of the training sets used to create GPT-3 is kept secret, but recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal.
The Common Crawl dataset is well known and pretty much summarizes the entire internet. It is available in mC4 in multiple languages. Google created the dataset and it was much more difficult to obtain than an English language dataset. According to Google researchers, it was enough for C4, the English-language dataset, to include the available digital content as of April 2019. mC4 required aggregating 71 monthly web scrapes from Common Crawl.
The research also states that OpenAI’s filter to check online content for quality does not work sufficiently for Dutch-language content. This raises the question: Is ChatGPT a chatbot of privacy violation and piracy?

Pirate Site Leader

It is likely that this dataset also forms the basis for GPT-3 and therefore ChatGPT. Multilingual data sets are difficult to compile and are therefore not numerous. De Groene Amsterdammer went to work with that theory and came to the conclusion that the MC4 dataset is most likely behind the OpenAI language model. Furthermore, it was viewed and which Dutch-language websites form the basis for the training sets. It can be said that the top twenty contains at least surprising results.
The largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. It accounts for 3.6 percent of the total data set. The website is a haven for hackers, as private information such as applicant evaluation documents are freely available. The website constantly scours the internet for files. The website also contains data from data breaches, complete resumes and tax returns. It was not long before the website was found illegal by the Dutch Data Protection Authority and the National Cyber ​​Security Center. Nevertheless, the website is still up and running.
The top three also consists of tripadvisor.nl (1.9%) and pronunciations.rechtspraak.nl (1.2%). Advertisements from private sellers also complemented the dataset well. 0.3 percent comes from ebay.nl, which ranks eleven and marktplaats.nl has a share of 0.2 percent. As a result, the language model has knowledge of many telephone numbers from advertisements on these websites.

Conclusion

ChatGPT is a chatbot that speaks Dutch and is powered by GPT-3, a natural language processing (NLP) language model. The composition of the training sets used to create GPT-3 is kept secret, but recent research has revealed that the Dutch datasets for training language models are largely fed by a pirate site that has been found illegal. This raises the question: Is ChatGPT a chatbot of privacy violation and piracy?
The largest source for the mC4 dataset is the controversial Dutch pirate site Docplayer. It accounts for 3.6 percent of the total data set. The website is a haven for hackers, as private information such as applicant evaluation documents are freely available. The website constantly scours the internet for files. The website also contains data from data breaches, complete resumes and tax returns. It was not long before the website was found illegal by the Dutch Data Protection Authority and the National Cyber ​​Security Center.
The research also states that

Weekly Updates For Our Loyal Readers!

Security Parrot Editorial Team June 8, 2023
Share this Article
Facebook Twitter Email Copy Link Print

Archives

  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • February 2023
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020

You Might Also Like

News

OpenAI may use Associated Press archive for AI training

July 14, 2023
News

EU users can hold conversations with Google Bard from training set

July 14, 2023
News

Aptos, the new default font for Microsoft Office

July 14, 2023
News

BlackLotus UEFI bootkit sources published on GitHub

July 14, 2023

© 2022 Parrot Media Network. All Rights Reserved.

  • Home
  • Parrot Media Group
  • Privacy Policy
  • Terms and Conditions
Join Us!

Subscribe to our newsletter and never miss our latest news, podcasts etc..

Zero spam, Unsubscribe at any time.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?