{"id":7353,"date":"2025-04-02T12:04:30","date_gmt":"2025-04-02T12:04:30","guid":{"rendered":"https:\/\/favtutor.com\/articles\/?p=7353"},"modified":"2025-04-12T07:19:29","modified_gmt":"2025-04-12T07:19:29","slug":"chatgpt-copyright-concern-oreilly-books","status":"publish","type":"post","link":"https:\/\/favtutor.com\/articles\/chatgpt-copyright-concern-oreilly-books\/","title":{"rendered":"ChatGPT may have Stolen Content from O\u2019Reilly Books"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Generative AI Tools know all about our world, but how? They were trained on public data from the Internet. But can they somehow access private content that is behind paywalls? Can they read it? New research suggests they can.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>New Evidence: ChatGPT was trained on O&#8217;Reilly Books<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Researchers were set out on a mission to uncover whether non-public content from O&#8217;Reilly Media books was sneakily included in OpenAI&#8217;s training data. They used a DE-COP membership inference attack designed in 2014 to detect copyrighted content. The results were shocking!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>OpenAI&#8217;s GPT-4o model scored 82% AUROC on non-public, paywalled content.<\/strong> It was much higher than random guessing. This suggests that GPT-4o was trained on the premium content from O&#8217;Reilly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, GPT-3.5 Turbo (the older ChatGPT model) didn&#8217;t show any pattern to indicate stealing of copyrighted content. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">O&#8217;Reilly Media isn&#8217;t any small publisher. They are known for many popular technical books. They offer both public content as well as paywalled books that are very high-quality. Founded in 1978, the company&#8217;s current estimated annual revenue is up to $500 million.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While the study focuses on O\u2019Reilly Media books and ChatGPT models, it raises concerns that similar practices could be widespread across the AI industry, potentially harming the broader ecosystem of digital content.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What did the new Research Paper find?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/ssrc-static.s3.us-east-1.amazonaws.com\/OpenAI-Training-Violations-OReillyBooks_Sruly-OReilly-Strauss_SSRC_04012025.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">study<\/a> aims to determine if non-public content from O\u2019Reilly books was included in the training data of OpenAI\u2019s models, particularly comparing older models (GPT-3.5 Turbo) with more recent ones (GPT-4o and GPT-4o Mini). Tim O\u2019Reilly was also part of this research.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">AI models like ChatGPT need vast amounts of data to learn language patterns, context, and reasoning. Training on diverse sources also improves adaptability, making AI useful for conversations, coding, and creative writing. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">O\u2019Reilly books typically have two sections: publicly accessible preview content and non-public content behind a paywall. The preview is just the first 1,500 characters of each chapter as well as the entirety of chapters one and four. This unique split allows researchers to check if models are recognizing content they shouldn\u2019t have seen during training.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers used a legally obtained dataset of 34 copyrighted O\u2019Reilly Media books. They split the content into:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Public text:<\/strong> Excerpts (e.g., first 1,500 characters of chapters) made freely available.<\/li>\n\n\n\n<li><strong>Non-public text:<\/strong> The remainder of the text that is paywalled.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">They employed a method where the model is given a multiple-choice quiz. For each paragraph from the books, the model has to identify which option is the original human-authored text among paraphrased alternatives generated by another model (Claude 3.5 Sonnet).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By comparing the model\u2019s performance on texts published before the training cutoff (potentially seen) versus texts published after (definitely unseen), they calculated AUROCscores. An AUROC of 50%  closer to 100% suggests strong recognition (i.e., prior exposure in training).<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"683\" height=\"419\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2025\/04\/Capture.jpg\" alt=\"AUROC Scores for ChatGPT on O'Reilly Scores\" class=\"wp-image-7355\"\/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Note that the training cutoff for GPT-4o and GPT-4o mini is October 2024, while it is September 2021 for GPT-3.5 Turbo.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">GPT-4o achieved an AUROC score of about 82% for the non-public O\u2019Reilly book content, indicating it recognizes this paywalled material much better than random chance. It was just 64% of the public content.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s the conclusion of the study: &#8220;<em>GPT-4o\u2019s high familiarity with O\u2019Reilly Media books likely reflects a deliberate effort by OpenAI to train on the O\u2019Reilly book dataset.<\/em>&#8220;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Takeaways<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">While one experiment is not the final truth, it does stir up questions about how AI models are trained and where their data comes from. The question is there for transparency. Greater transparency would help content creators receive compensation. While <a href=\"https:\/\/favtutor.com\/articles\/openai-deals-content-websites\/\">OpenAI is partnering up with many content websites<\/a> to get content ethically, their past practices will always be a big concern.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Generative AI Tools know all about our world, but how? They were trained on public data from the Internet. But can they somehow access private content that is behind paywalls? Can they read it? New research suggests they can. New Evidence: ChatGPT was trained on O&#8217;Reilly Books Researchers were set out on a mission to [&hellip;]<\/p>\n","protected":false},"author":33,"featured_media":7354,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":{"format":"standard"},"jnews_primary_category":[],"footnotes":""},"categories":[57],"tags":[56,61,60],"class_list":["post-7353","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai","tag-chatgpt","tag-openai"],"_links":{"self":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/7353","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/comments?post=7353"}],"version-history":[{"count":1,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/7353\/revisions"}],"predecessor-version":[{"id":7356,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/7353\/revisions\/7356"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media\/7354"}],"wp:attachment":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media?parent=7353"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/categories?post=7353"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/tags?post=7353"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}