Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"데이터 처리","local":"processing-the-data","sections":[{"title":"Hub에서 데이터 세트 가져오기","local":"loading-a-dataset-from-the-hub","sections":[],"depth":3},{"title":"데이터 세트 전처리","local":"preprocessing-a-dataset","sections":[{"title":"동적 패딩","local":"dynamic-padding","sections":[],"depth":5}],"depth":3},{"title":"섹션 퀴즈","local":"section-quiz","sections":[{"title":"1. batched=True 와 함께 Dataset.map() 을 사용하는 주요 장점은 무엇인가요?","local":"1-batchedtrue-와-함께-datasetmap-을-사용하는-주요-장점은-무엇인가요","sections":[],"depth":3},{"title":"2. 데이터 세트의 최대 길이로 모든 시퀀스를 패딩하는 대신 동적 패딩을 사용하는 이유는 무엇인가요?","local":"2-데이터-세트의-최대-길이로-모든-시퀀스를-패딩하는-대신-동적-패딩을-사용하는-이유는-무엇인가요","sections":[],"depth":3},{"title":"3. BERT 토큰화에서 token_type_ids 필드는 무엇을 나타내나요?","local":"3-bert-토큰화에서-tokentypeids-필드는-무엇을-나타내나요","sections":[],"depth":3},{"title":"4. load_dataset('glue', 'mrpc') 로 데이터 세트를 로딩할 때 두 번째 인수는 무엇을 지정하나요?","local":"4-loaddatasetglue-mrpc-로-데이터-세트를-로딩할-때-두-번째-인수는-무엇을-지정하나요","sections":[],"depth":3},{"title":"5. 훈련 전에 ‘sentence1’과 ‘sentence2’ 같은 열을 제거하는 목적은 무엇인가요?","local":"5-훈련-전에-sentence1과-sentence2-같은-열을-제거하는-목적은-무엇인가요","sections":[],"depth":3}],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1069/ko/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/entry/start.667ba29b.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/scheduler.37c15a92.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/singletons.8de3d8e2.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/index.18351ede.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/paths.6047fff5.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/entry/app.65e35200.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/index.2bf4358c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/nodes/0.9169bc8c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/nodes/21.3c80e2e5.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/Tip.363c041f.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/Youtube.1e50a667.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/CodeBlock.4e987730.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/CourseFloatingBanner.9ff4c771.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/DocNotebookDropdown.efc1fb7c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/Question.668688bc.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/ko/_app/immutable/chunks/getInferenceSnippets.24b50994.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"데이터 처리","local":"processing-the-data","sections":[{"title":"Hub에서 데이터 세트 가져오기","local":"loading-a-dataset-from-the-hub","sections":[],"depth":3},{"title":"데이터 세트 전처리","local":"preprocessing-a-dataset","sections":[{"title":"동적 패딩","local":"dynamic-padding","sections":[],"depth":5}],"depth":3},{"title":"섹션 퀴즈","local":"section-quiz","sections":[{"title":"1. batched=True 와 함께 Dataset.map() 을 사용하는 주요 장점은 무엇인가요?","local":"1-batchedtrue-와-함께-datasetmap-을-사용하는-주요-장점은-무엇인가요","sections":[],"depth":3},{"title":"2. 데이터 세트의 최대 길이로 모든 시퀀스를 패딩하는 대신 동적 패딩을 사용하는 이유는 무엇인가요?","local":"2-데이터-세트의-최대-길이로-모든-시퀀스를-패딩하는-대신-동적-패딩을-사용하는-이유는-무엇인가요","sections":[],"depth":3},{"title":"3. BERT 토큰화에서 token_type_ids 필드는 무엇을 나타내나요?","local":"3-bert-토큰화에서-tokentypeids-필드는-무엇을-나타내나요","sections":[],"depth":3},{"title":"4. load_dataset('glue', 'mrpc') 로 데이터 세트를 로딩할 때 두 번째 인수는 무엇을 지정하나요?","local":"4-loaddatasetglue-mrpc-로-데이터-세트를-로딩할-때-두-번째-인수는-무엇을-지정하나요","sections":[],"depth":3},{"title":"5. 훈련 전에 ‘sentence1’과 ‘sentence2’ 같은 열을 제거하는 목적은 무엇인가요?","local":"5-훈련-전에-sentence1과-sentence2-같은-열을-제거하는-목적은-무엇인가요","sections":[],"depth":3}],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="processing-the-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#processing-the-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>데이터 처리</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"><a href="https://discuss.huggingface.co/t/chapter-3-questions" target="_blank"><img alt="Ask a Question" class="!m-0" src="https://img.shields.io/badge/Ask%20a%20question-ffcb4c.svg?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgLTEgMTA0IDEwNiI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOiMyMzFmMjA7fS5jbHMtMntmaWxsOiNmZmY5YWU7fS5jbHMtM3tmaWxsOiMwMGFlZWY7fS5jbHMtNHtmaWxsOiMwMGE5NGY7fS5jbHMtNXtmaWxsOiNmMTVkMjI7fS5jbHMtNntmaWxsOiNlMzFiMjM7fTwvc3R5bGU+PC9kZWZzPjx0aXRsZT5EaXNjb3Vyc2VfbG9nbzwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiPjxnIGlkPSJMYXllcl8zIj48cGF0aCBjbGFzcz0iY2xzLTEiIGQ9Ik01MS44NywwQzIzLjcxLDAsMCwyMi44MywwLDUxYzAsLjkxLDAsNTIuODEsMCw1Mi44MWw1MS44Ni0uMDVjMjguMTYsMCw1MS0yMy43MSw1MS01MS44N1M4MCwwLDUxLjg3LDBaIi8+PHBhdGggY2xhc3M9ImNscy0yIiBkPSJNNTIuMzcsMTkuNzRBMzEuNjIsMzEuNjIsMCwwLDAsMjQuNTgsNjYuNDFsLTUuNzIsMTguNEwzOS40LDgwLjE3YTMxLjYxLDMxLjYxLDAsMSwwLDEzLTYwLjQzWiIvPjxwYXRoIGNsYXNzPSJjbHMtMyIgZD0iTTc3LjQ1LDMyLjEyYTMxLjYsMzEuNiwwLDAsMS0zOC4wNSw0OEwxOC44Niw4NC44MmwyMC45MS0yLjQ3QTMxLjYsMzEuNiwwLDAsMCw3Ny40NSwzMi4xMloiLz48cGF0aCBjbGFzcz0iY2xzLTQiIGQ9Ik03MS42MywyNi4yOUEzMS42LDMxLjYsMCwwLDEsMzguOCw3OEwxOC44Niw4NC44MiwzOS40LDgwLjE3QTMxLjYsMzEuNiwwLDAsMCw3MS42MywyNi4yOVoiLz48cGF0aCBjbGFzcz0iY2xzLTUiIGQ9Ik0yNi40Nyw2Ny4xMWEzMS42MSwzMS42MSwwLDAsMSw1MS0zNUEzMS42MSwzMS42MSwwLDAsMCwyNC41OCw2Ni40MWwtNS43MiwxOC40WiIvPjxwYXRoIGNsYXNzPSJjbHMtNiIgZD0iTTI0LjU4LDY2LjQxQTMxLjYxLDMxLjYxLDAsMCwxLDcxLjYzLDI2LjI5YTMxLjYxLDMxLjYxLDAsMCwwLTQ5LDM5LjYzbC0zLjc2LDE4LjlaIi8+PC9nPjwvZz48L3N2Zz4="></a> <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter3/section2.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter3/section2.ipynb" target="_blank"><img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"></a></div> <p data-svelte-h="svelte-18n4e1w"><a href="/course/chapter2">이전 챕터</a>의 예제에 이어서, 한 배치에서 시퀀스 분류기를 훈련하는 방법은 다음과 같습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">from</span> torch.optim <span class="hljs-keyword">import</span> AdamW | |
| <span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer, AutoModelForSequenceClassification | |
| <span class="hljs-comment"># 이전과 동일</span> | |
| checkpoint = <span class="hljs-string">"bert-base-uncased"</span> | |
| tokenizer = AutoTokenizer.from_pretrained(checkpoint) | |
| model = AutoModelForSequenceClassification.from_pretrained(checkpoint) | |
| sequences = [ | |
| <span class="hljs-string">"I've been waiting for a HuggingFace course my whole life."</span>, | |
| <span class="hljs-string">"This course is amazing!"</span>, | |
| ] | |
| batch = tokenizer(sequences, padding=<span class="hljs-literal">True</span>, truncation=<span class="hljs-literal">True</span>, return_tensors=<span class="hljs-string">"pt"</span>) | |
| <span class="hljs-comment"># 여기가 새로운 부분</span> | |
| batch[<span class="hljs-string">"labels"</span>] = torch.tensor([<span class="hljs-number">1</span>, <span class="hljs-number">1</span>]) | |
| optimizer = AdamW(model.parameters()) | |
| loss = model(**batch).loss | |
| loss.backward() | |
| optimizer.step()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1yvacnx">물론 두 문장만으로 모델을 훈련하는 것으로는 매우 좋은 결과를 얻을 수 없습니다. 더 나은 결과를 얻으려면 더 큰 데이터 세트를 준비해야 합니다.</p> <p data-svelte-h="svelte-6tmng3">이 섹션에서는 William B. Dolan과 Chris Brockett의 <a href="https://www.aclweb.org/anthology/I05-5002.pdf" rel="nofollow">논문</a>에서 소개된 MRPC(Microsoft Research Paraphrase Corpus) 데이터 세트를 예제로 사용하겠습니다. 이 데이터 세트는 5,801개의 문장 쌍으로 구성되어 있으며, 각 쌍이 패러프레이즈인지 아닌지를 나타내는 레이블이 있습니다(즉, 두 문장이 같은 의미인지). 이 챕터에서 이 데이터 세트를 선택한 이유는 작은 데이터 세트이므로 훈련 실험을 하기에 쉽기 때문입니다.</p> <h3 class="relative group"><a id="loading-a-dataset-from-the-hub" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#loading-a-dataset-from-the-hub"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Hub에서 데이터 세트 가져오기</span></h3> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/_BZearw7f0w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <p data-svelte-h="svelte-1quhi9e">Hub에는 모델뿐만 아니라 다양한 언어로 된 여러 데이터 세트도 있습니다. <a href="https://huggingface.co/datasets" rel="nofollow">여기</a>에서 데이터 세트를 찾아볼 수 있으며, 이 섹션을 완료한 후에는 새로운 데이터 세트를 로드하고 처리해보는 것을 권장합니다(<a href="https://huggingface.co/docs/datasets/loading" rel="nofollow">여기</a>에서 일반적인 문서를 참조하세요). 하지만 지금은 MRPC 데이터 세트에 집중해보겠습니다! 이것은 <a href="https://gluebenchmark.com/" rel="nofollow">GLUE 벤치마크</a>를 구성하는 10개 데이터 세트 중 하나로, 10개의 서로 다른 텍스트 분류 작업에 걸쳐 ML 모델의 성능을 측정하는 데 사용되는 학술적 벤치마크입니다.</p> <p data-svelte-h="svelte-1b27c1z">🤗 Datasets 라이브러리는 Hub에서 데이터 세트를 다운로드하고 캐시하는 매우 간단한 명령을 제공합니다. MRPC 데이터 세트를 다음과 같이 다운로드할 수 있습니다.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-qls13x">💡 <strong>추가 자료</strong>: 더 많은 데이터 세트 로딩 기법과 예제를 보려면 <a href="https://huggingface.co/docs/datasets/" rel="nofollow">🤗 Datasets 문서</a>를 확인하세요.</p></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| raw_datasets = load_dataset(<span class="hljs-string">"glue"</span>, <span class="hljs-string">"mrpc"</span>) | |
| raw_datasets<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'sentence1'</span>, <span class="hljs-string">'sentence2'</span>, <span class="hljs-string">'label'</span>, <span class="hljs-string">'idx'</span>], | |
| num_rows: <span class="hljs-number">3668</span> | |
| }) | |
| validation: Dataset({ | |
| features: [<span class="hljs-string">'sentence1'</span>, <span class="hljs-string">'sentence2'</span>, <span class="hljs-string">'label'</span>, <span class="hljs-string">'idx'</span>], | |
| num_rows: <span class="hljs-number">408</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'sentence1'</span>, <span class="hljs-string">'sentence2'</span>, <span class="hljs-string">'label'</span>, <span class="hljs-string">'idx'</span>], | |
| num_rows: <span class="hljs-number">1725</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1f0x3aj">보시다시피, 훈련 세트, 검증 세트, 테스트 세트가 포함된 <code>DatasetDict</code> 객체를 얻습니다. 각각은 여러 열(<code>sentence1</code>, <code>sentence2</code>, <code>label</code>, <code>idx</code>)과 가변적인 행 수를 포함하며, 이는 각 세트의 요소 수입니다(따라서 훈련 세트에는 3,668개의 문장 쌍이, 검증 세트에는 408개가, 테스트 세트에는 1,725개가 있습니다).</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1t8miz5">이 명령은 기본적으로 <em>~/.cache/huggingface/datasets</em>에 데이터 세트를 다운로드하고 캐시합니다. 2장에서 언급했듯이 <code>HF_HOME</code> 환경 변수를 설정하여 캐시 폴더를 맞춤 설정할 수 있습니다.</p></div> <p data-svelte-h="svelte-nbm9uo">딕셔너리처럼 인덱싱하여 <code>raw_datasets</code> 객체의 각 문장 쌍에 접근할 수 있습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->raw_train_dataset = raw_datasets[<span class="hljs-string">"train"</span>] | |
| raw_train_dataset[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'idx'</span>: <span class="hljs-number">0</span>, | |
| <span class="hljs-string">'label'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">'sentence1'</span>: <span class="hljs-string">'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .'</span>, | |
| <span class="hljs-string">'sentence2'</span>: <span class="hljs-string">'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-11hwp3u">레이블이 이미 정수로 되어 있으므로 여기서 전처리를 할 필요가 없습니다. 어떤 정수가 어떤 레이블에 해당하는지 알아보려면 <code>raw_train_dataset</code>의 <code>features</code>를 검사하면 됩니다. 이것은 각 열의 타입을 알려줍니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->raw_train_dataset.features<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'sentence1'</span>: Value(dtype=<span class="hljs-string">'string'</span>, <span class="hljs-built_in">id</span>=<span class="hljs-literal">None</span>), | |
| <span class="hljs-string">'sentence2'</span>: Value(dtype=<span class="hljs-string">'string'</span>, <span class="hljs-built_in">id</span>=<span class="hljs-literal">None</span>), | |
| <span class="hljs-string">'label'</span>: ClassLabel(num_classes=<span class="hljs-number">2</span>, names=[<span class="hljs-string">'not_equivalent'</span>, <span class="hljs-string">'equivalent'</span>], names_file=<span class="hljs-literal">None</span>, <span class="hljs-built_in">id</span>=<span class="hljs-literal">None</span>), | |
| <span class="hljs-string">'idx'</span>: Value(dtype=<span class="hljs-string">'int32'</span>, <span class="hljs-built_in">id</span>=<span class="hljs-literal">None</span>)}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1i632h4">내부적으로 <code>label</code>은 <code>ClassLabel</code> 타입이며, 정수와 레이블 이름의 매핑이 <em>names</em> 폴더에 저장되어 있습니다. <code>0</code>은 <code>not_equivalent</code>에, <code>1</code>은 <code>equivalent</code>에 해당합니다.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-hik0zm">✏️ <strong>직접 해보기!</strong> 훈련 세트의 15번째 요소와 검증 세트의 87번째 요소를 살펴보세요. 그들의 레이블은 무엇인가요?</p></div> <h3 class="relative group"><a id="preprocessing-a-dataset" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#preprocessing-a-dataset"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>데이터 세트 전처리</span></h3> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/0u3ioSwev3s" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <p data-svelte-h="svelte-72w04v">데이터 세트를 전처리하려면 텍스트를 모델이 이해할 수 있는 숫자로 변환해야 합니다. <a href="/course/chapter2">이전 챕터</a>에서 보았듯이, 이는 토크나이저로 수행됩니다. 토크나이저에 한 문장이나 문장 목록을 입력할 수 있으므로, 다음과 같이 각 쌍의 모든 첫 번째 문장과 모든 두 번째 문장을 직접 토큰화할 수 있습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer | |
| checkpoint = <span class="hljs-string">"bert-base-uncased"</span> | |
| tokenizer = AutoTokenizer.from_pretrained(checkpoint) | |
| tokenized_sentences_1 = tokenizer(raw_datasets[<span class="hljs-string">"train"</span>][<span class="hljs-string">"sentence1"</span>]) | |
| tokenized_sentences_2 = tokenizer(raw_datasets[<span class="hljs-string">"train"</span>][<span class="hljs-string">"sentence2"</span>])<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-wel7bo">💡 <strong>심화 학습</strong>: 더 고급 토큰화 기법과 다양한 토크나이저가 작동하는 방식을 이해하려면 <a href="https://huggingface.co/docs/transformers/main/en/tokenizer_summary" rel="nofollow">🤗 Tokenizers 문서</a>와 <a href="https://huggingface.co/learn/cookbook/en/advanced_rag#tokenization-strategies" rel="nofollow">쿡북의 토큰화 가이드</a>를 살펴보세요.</p></div> <p data-svelte-h="svelte-1gliuzv">하지만 두 시퀀스를 모델에 전달하기만 해서는 두 문장이 패러프레이즈인지 아닌지 예측할 수 없습니다. 두 시퀀스를 쌍으로 처리하고 적절한 전처리를 적용해야 합니다. 다행히 토크나이저는 한 쌍의 시퀀스를 받아서 BERT 모델이 기대하는 방식으로 준비할 수도 있습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->inputs = tokenizer(<span class="hljs-string">"This is the first sentence."</span>, <span class="hljs-string">"This is the second one."</span>) | |
| inputs<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{ | |
| <span class="hljs-string">'input_ids'</span>: [<span class="hljs-number">101</span>, <span class="hljs-number">2023</span>, <span class="hljs-number">2003</span>, <span class="hljs-number">1996</span>, <span class="hljs-number">2034</span>, <span class="hljs-number">6251</span>, <span class="hljs-number">1012</span>, <span class="hljs-number">102</span>, <span class="hljs-number">2023</span>, <span class="hljs-number">2003</span>, <span class="hljs-number">1996</span>, <span class="hljs-number">2117</span>, <span class="hljs-number">2028</span>, <span class="hljs-number">1012</span>, <span class="hljs-number">102</span>], | |
| <span class="hljs-string">'token_type_ids'</span>: [<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>], | |
| <span class="hljs-string">'attention_mask'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>] | |
| }<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1dayjrj"><a href="/course/chapter2">2장</a>에서 <code>input_ids</code>와 <code>attention_mask</code> 키에 대해 논의했지만, <code>token_type_ids</code>에 대한 이야기는 미뤄두었습니다. 이 예제에서 이것은 입력의 어느 부분이 첫 번째 문장이고 어느 부분이 두 번째 문장인지 모델에 알려주는 역할을 합니다.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-xfqr76">✏️ <strong>직접 해보기!</strong> 훈련 세트의 15번째 요소를 가져와서 두 문장을 따로따로 토큰화하고 쌍으로도 토큰화해보세요. 두 결과의 차이점은 무엇인가요?</p></div> <p data-svelte-h="svelte-fp8ycy"><code>input_ids</code> 안의 ID를 다시 단어로 디코딩하면</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenizer.convert_ids_to_tokens(inputs[<span class="hljs-string">"input_ids"</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1a5aa9u">다음을 얻습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'[CLS]'</span>, <span class="hljs-string">'this'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'the'</span>, <span class="hljs-string">'first'</span>, <span class="hljs-string">'sentence'</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'[SEP]'</span>, <span class="hljs-string">'this'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'the'</span>, <span class="hljs-string">'second'</span>, <span class="hljs-string">'one'</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'[SEP]'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-tqglnc">따라서 모델은 두 문장이 있을 때 입력이 <code>[CLS] sentence1 [SEP] sentence2 [SEP]</code> 형태이기를 기대한다는 것을 알 수 있습니다. 이를 <code>token_type_ids</code>와 맞춰보면</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'[CLS]'</span>, <span class="hljs-string">'this'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'the'</span>, <span class="hljs-string">'first'</span>, <span class="hljs-string">'sentence'</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'[SEP]'</span>, <span class="hljs-string">'this'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'the'</span>, <span class="hljs-string">'second'</span>, <span class="hljs-string">'one'</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'[SEP]'</span>] | |
| [ <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-13qt3k3">보시다시피, <code>[CLS] sentence1 [SEP]</code>에 해당하는 입력 부분은 모두 토큰 타입 ID가 <code>0</code>이고, <code>sentence2 [SEP]</code>에 해당하는 다른 부분들은 모두 토큰 타입 ID가 <code>1</code>입니다.</p> <p data-svelte-h="svelte-1x3j9xz">다른 체크포인트(checkpoint)를 선택하면 토큰화된 입력에 <code>token_type_ids</code>가 반드시 있지는 않다는 점에 주의하세요(예를 들어, DistilBERT 모델을 사용하면 반환되지 않습니다). 모델이 사전 훈련 중에 이를 본 적이 있어서 무엇을 해야 할지 알 때만 반환됩니다.</p> <p data-svelte-h="svelte-16t7d1w">여기서 BERT는 토큰 타입 ID로 사전 훈련되었으며, <a href="/course/chapter1">1장</a>에서 이야기한 마스크드 언어 모델링 목표 외에도 <em>다음 문장 예측</em>이라는 추가 목표를 가지고 있습니다. 이 작업의 목표는 문장 쌍 간의 관계를 모델링하는 것입니다.</p> <p data-svelte-h="svelte-mmugyu">다음 문장 예측에서는 모델에 문장 쌍(무작위로 마스킹된 토큰과 함께)이 제공되고 두 번째 문장이 첫 번째 문장을 따르는지 예측하도록 요청받습니다. 작업을 쉽지 않게 만들기 위해, 절반의 경우에는 문장들이 추출된 원본 문서에서 서로를 따르고, 나머지 절반의 경우에는 두 문장이 서로 다른 문서에서 나옵니다.</p> <p data-svelte-h="svelte-9ytk6a">일반적으로 토큰화된 입력에 <code>token_type_ids</code>가 있는지 여부에 대해 걱정할 필요는 없습니다. 토크나이저와 모델에 동일한 체크포인트(checkpoint)를 사용하는 한, 토크나이저가 모델에 제공해야 할 것을 알고 있으므로 모든 것이 잘 될 것입니다.</p> <p data-svelte-h="svelte-1lt9uol">이제 토크나이저가 한 쌍의 문장을 어떻게 처리할 수 있는지 보았으므로, 이를 사용하여 전체 데이터 세트를 토큰화할 수 있습니다: <a href="/course/chapter2">이전 챕터</a>에서처럼, 첫 번째 문장 목록과 두 번째 문장 목록을 제공하여 토크나이저에 문장 쌍 목록을 입력할 수 있습니다. 이는 <a href="/course/chapter2">2장</a>에서 본 패딩과 생략 옵션과도 호환됩니다. 따라서 훈련 데이터 세트를 전처리하는 한 가지 방법은</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenized_dataset = tokenizer( | |
| raw_datasets[<span class="hljs-string">"train"</span>][<span class="hljs-string">"sentence1"</span>], | |
| raw_datasets[<span class="hljs-string">"train"</span>][<span class="hljs-string">"sentence2"</span>], | |
| padding=<span class="hljs-literal">True</span>, | |
| truncation=<span class="hljs-literal">True</span>, | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1qzyd20">이것은 잘 작동하지만, 딕셔너리(키는 <code>input_ids</code>, <code>attention_mask</code>, <code>token_type_ids</code>이고 값은 목록의 목록)를 반환한다는 단점이 있습니다. 또한 토큰화 중에 전체 데이터 세트를 메모리에 저장할 수 있는 충분한 RAM이 있는 경우에만 작동합니다(🤗 Datasets 라이브러리의 데이터 세트는 디스크에 저장된 <a href="https://arrow.apache.org/" rel="nofollow">Apache Arrow</a> 파일이므로, 요청한 샘플만 메모리에 로드됩니다).</p> <p data-svelte-h="svelte-rdvnbx">데이터를 데이터 세트로 유지하려면 <a href="https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map" rel="nofollow"><code>Dataset.map()</code></a> 메소드를 사용하겠습니다. 이는 토큰화 이상의 전처리가 필요한 경우 추가적인 유연성도 제공합니다. <code>map()</code> 메소드는 데이터 세트의 각 요소에 함수를 적용하여 작동하므로, 입력을 토큰화하는 함수를 정의해보겠습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize_function</span>(<span class="hljs-params">example</span>): | |
| <span class="hljs-keyword">return</span> tokenizer(example[<span class="hljs-string">"sentence1"</span>], example[<span class="hljs-string">"sentence2"</span>], truncation=<span class="hljs-literal">True</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-f19j5a">이 함수는 딕셔너리(데이터 세트의 항목과 같은)를 받아서 <code>input_ids</code>, <code>attention_mask</code>, <code>token_type_ids</code> 키가 있는 새 딕셔너리를 반환합니다. <code>example</code> 딕셔너리에 여러 샘플이 포함되어 있어도(각 키가 문장 목록으로) 작동한다는 점에 주목하세요. 앞서 본 것처럼 <code>tokenizer</code>는 문장 쌍의 목록에서 작동하기 때문입니다. 이를 통해 <code>map()</code> 호출에서 <code>batched=True</code> 옵션을 사용할 수 있으며, 이는 토큰화를 크게 가속화할 것입니다. <code>tokenizer</code>는 <a href="https://github.com/huggingface/tokenizers" rel="nofollow">🤗 Tokenizers</a> 라이브러리의 Rust로 작성된 토크나이저로 뒷받침됩니다. 이 토크나이저는 매우 빠를 수 있지만, 한 번에 많은 입력을 제공해야만 그렇습니다.</p> <p data-svelte-h="svelte-1if9us4">지금은 토큰화 함수에서 <code>padding</code> 인수를 빼둔 것에 주목하세요. 모든 샘플을 최대 길이로 패딩하는 것은 효율적이지 않기 때문입니다. 배치를 만들 때 샘플을 패딩하는 것이 더 좋습니다. 그러면 해당 배치의 최대 길이까지만 패딩하면 되고, 전체 데이터 세트의 최대 길이까지 패딩할 필요가 없기 때문입니다. 입력의 길이가 매우 다양할 때 많은 시간과 처리 능력을 절약할 수 있습니다!</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-4fzn75">📚 <strong>성능 팁</strong>: 효율적인 데이터 처리 기법에 대한 자세한 내용은 <a href="https://huggingface.co/docs/datasets/about_arrow" rel="nofollow">🤗 Datasets 성능 가이드</a>에서 배울 수 있습니다.</p></div> <p data-svelte-h="svelte-1qq8yfu">다음은 모든 데이터 세트에 토큰화 함수를 한 번에 적용하는 방법입니다. <code>map</code> 호출에서 <code>batched=True</code>를 사용하므로 함수가 데이터 세트의 각 요소에 개별적으로가 아니라 여러 요소에 한 번에 적용됩니다. 이를 통해 더 빠른 전처리가 가능합니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenized_datasets = raw_datasets.<span class="hljs-built_in">map</span>(tokenize_function, batched=<span class="hljs-literal">True</span>) | |
| tokenized_datasets<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1uiluni">🤗 Datasets 라이브러리가 이 처리를 적용하는 방식은 전처리 함수가 반환하는 딕셔너리의 각 키에 대해 데이터 세트에 새 필드를 추가하는 것입니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'attention_mask'</span>, <span class="hljs-string">'idx'</span>, <span class="hljs-string">'input_ids'</span>, <span class="hljs-string">'label'</span>, <span class="hljs-string">'sentence1'</span>, <span class="hljs-string">'sentence2'</span>, <span class="hljs-string">'token_type_ids'</span>], | |
| num_rows: <span class="hljs-number">3668</span> | |
| }) | |
| validation: Dataset({ | |
| features: [<span class="hljs-string">'attention_mask'</span>, <span class="hljs-string">'idx'</span>, <span class="hljs-string">'input_ids'</span>, <span class="hljs-string">'label'</span>, <span class="hljs-string">'sentence1'</span>, <span class="hljs-string">'sentence2'</span>, <span class="hljs-string">'token_type_ids'</span>], | |
| num_rows: <span class="hljs-number">408</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'attention_mask'</span>, <span class="hljs-string">'idx'</span>, <span class="hljs-string">'input_ids'</span>, <span class="hljs-string">'label'</span>, <span class="hljs-string">'sentence1'</span>, <span class="hljs-string">'sentence2'</span>, <span class="hljs-string">'token_type_ids'</span>], | |
| num_rows: <span class="hljs-number">1725</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1irxq8x"><code>num_proc</code> 인수를 전달하여 <code>map()</code>으로 전처리 함수를 적용할 때 멀티프로세싱을 사용할 수도 있습니다. 🤗 Tokenizers 라이브러리가 이미 여러 스레드를 사용하여 샘플을 더 빠르게 토큰화하므로 여기서는 이를 사용하지 않았지만, 이 라이브러리가 뒷받침하는 빠른 토크나이저를 사용하지 않는다면 전처리 속도를 높일 수 있습니다.</p> <p data-svelte-h="svelte-15wpq8c">우리의 <code>tokenize_function</code>은 <code>input_ids</code>, <code>attention_mask</code>, <code>token_type_ids</code> 키가 있는 딕셔너리를 반환하므로, 이 세 필드가 데이터 세트의 모든 분할에 추가됩니다. 전처리 함수가 <code>map()</code>을 적용한 데이터 세트의 기존 키에 대한 새 값을 반환한다면 기존 필드를 변경할 수도 있었을 것입니다.</p> <p data-svelte-h="svelte-1qi6gn3">마지막으로 해야 할 일은 요소들을 배치로 묶을 때 모든 예제를 가장 긴 요소의 길이로 패딩하는 것입니다 — 이 기법을 <em>동적 패딩</em>이라고 합니다.</p> <h5 class="relative group"><a id="dynamic-padding" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#dynamic-padding"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>동적 패딩</span></h5> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/7q5NyFT8REg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <p data-svelte-h="svelte-1yf9txj">배치 내에서 샘플들을 함께 배치하는 역할을 하는 함수를 <em>collate function</em>이라고 합니다. 이는 <code>DataLoader</code>를 구축할 때 전달할 수 있는 인수로, 기본값은 샘플을 PyTorch 텐서로 변환하고 연결하는 함수입니다(요소가 목록, 튜플 또는 딕셔너리인 경우 재귀적으로). 우리의 경우 입력이 모두 같은 크기가 아니므로 이것은 불가능할 것입니다. 우리는 의도적으로 패딩을 연기하여 각 배치에서만 필요에 따라 적용하고 많은 패딩이 있는 지나치게 긴 입력을 피했습니다. 이것은 훈련을 상당히 가속화할 것이지만, TPU에서 훈련하는 경우 문제를 일으킬 수 있다는 점에 주의하세요 — TPU는 추가 패딩이 필요하더라도 고정된 모양을 선호합니다.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-15yxb2i">🚀 <strong>최적화 가이드</strong>: 패딩 전략과 TPU 고려사항을 포함한 훈련 성능 최적화에 대한 자세한 내용은 <a href="https://huggingface.co/docs/transformers/main/en/performance" rel="nofollow">🤗 Transformers 성능 문서</a>를 참조하세요.</p></div> <p data-svelte-h="svelte-ae4z87">실제로 이를 수행하려면 함께 배치하려는 데이터 세트 항목에 적절한 양의 패딩을 적용할 collate function을 정의해야 합니다. 다행히 🤗 Transformers 라이브러리는 <code>DataCollatorWithPadding</code>을 통해 이러한 함수를 제공합니다. 인스턴스화할 때 토크나이저를 받아서(어떤 패딩 토큰을 사용할지, 모델이 입력의 왼쪽 또는 오른쪽에 패딩을 기대하는지 알기 위해) 필요한 모든 것을 수행합니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> DataCollatorWithPadding | |
| data_collator = DataCollatorWithPadding(tokenizer=tokenizer)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1r7taxf">이 새로운 도구를 테스트하기 위해, 함께 배치하고 싶은 훈련 세트에서 몇 개의 샘플을 가져와보겠습니다. 여기서는 <code>idx</code>, <code>sentence1</code>, <code>sentence2</code> 열을 제거합니다. 이들은 필요하지 않고 문자열을 포함하고 있으며(문자열로는 텐서를 만들 수 없음), 배치의 각 항목 길이를 살펴보겠습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->samples = tokenized_datasets[<span class="hljs-string">"train"</span>][:<span class="hljs-number">8</span>] | |
| samples = {k: v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> samples.items() <span class="hljs-keyword">if</span> k <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> [<span class="hljs-string">"idx"</span>, <span class="hljs-string">"sentence1"</span>, <span class="hljs-string">"sentence2"</span>]} | |
| [<span class="hljs-built_in">len</span>(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> samples[<span class="hljs-string">"input_ids"</span>]]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-number">50</span>, <span class="hljs-number">59</span>, <span class="hljs-number">47</span>, <span class="hljs-number">67</span>, <span class="hljs-number">59</span>, <span class="hljs-number">50</span>, <span class="hljs-number">62</span>, <span class="hljs-number">32</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-rpp9y6">당연히 32부터 67까지 다양한 길이의 샘플을 얻습니다. 동적 패딩은 이 배치의 샘플들이 모두 배치 내 최대 길이인 67로 패딩되어야 함을 의미합니다. 동적 패딩이 없다면, 모든 샘플이 전체 데이터 세트의 최대 길이나 모델이 받을 수 있는 최대 길이로 패딩되어야 할 것입니다. <code>data_collator</code>가 배치를 동적으로 올바르게 패딩하는지 다시 확인해보겠습니다.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->batch = data_collator(samples) | |
| {k: v.shape <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> batch.items()}<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'attention_mask'</span>: torch.Size([<span class="hljs-number">8</span>, <span class="hljs-number">67</span>]), | |
| <span class="hljs-string">'input_ids'</span>: torch.Size([<span class="hljs-number">8</span>, <span class="hljs-number">67</span>]), | |
| <span class="hljs-string">'token_type_ids'</span>: torch.Size([<span class="hljs-number">8</span>, <span class="hljs-number">67</span>]), | |
| <span class="hljs-string">'labels'</span>: torch.Size([<span class="hljs-number">8</span>])}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ccey19">좋아 보입니다! 이제 원시 텍스트에서 모델이 처리할 수 있는 배치까지 만들었으므로, 미세 조정할 준비가 되었습니다!</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1qahcvz">✏️ <strong>직접 해보기!</strong> GLUE SST-2 데이터 세트에서 전처리를 복제해보세요. 쌍이 아닌 단일 문장으로 구성되어 있어 약간 다르지만, 나머지는 동일하게 보일 것입니다. 더 어려운 도전을 위해서는 GLUE 작업 중 어떤 것에서도 작동하는 전처리 함수를 작성해보세요.</p> <p data-svelte-h="svelte-p3aocu">📖 <strong>추가 연습</strong>: <a href="https://huggingface.co/docs/transformers/main/en/notebooks" rel="nofollow">🤗 Transformers 예제</a>에서 이러한 실습 예제들을 확인해보세요.</p></div> <p data-svelte-h="svelte-furr9b">완벽합니다! 이제 🤗 Datasets 라이브러리의 최신 모범 사례로 데이터를 전처리했으므로, 최신 Trainer API를 사용하여 모델을 훈련할 준비가 되었습니다. 다음 섹션에서는 Hugging Face 생태계에서 사용할 수 있는 최신 기능과 최적화를 사용하여 모델을 효과적으로 미세 조정하는 방법을 보여드리겠습니다.</p> <h2 class="relative group"><a id="section-quiz" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#section-quiz"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>섹션 퀴즈</span></h2> <p data-svelte-h="svelte-1jo93me">데이터 처리 개념에 대한 이해도를 테스트해보세요.</p> <h3 class="relative group"><a id="1-batchedtrue-와-함께-datasetmap-을-사용하는-주요-장점은-무엇인가요" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#1-batchedtrue-와-함께-datasetmap-을-사용하는-주요-장점은-무엇인가요"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>1. batched=True 와 함께 Dataset.map() 을 사용하는 주요 장점은 무엇인가요?</span></h3> <div><form><label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="0"> <!-- HTML_TAG_START -->메모리를 덜 사용합니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="1"> <!-- HTML_TAG_START -->여러 예제를 한 번에 처리하여 토큰화를 훨씬 빠르게 만듭니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="2"> <!-- HTML_TAG_START -->자동으로 패딩을 처리해줍니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="3"> <!-- HTML_TAG_START -->데이터를 PyTorch 텐서로 변환합니다.<!-- HTML_TAG_END --></label> <div class="flex flex-row items-center mt-3"><button class="btn px-4 mr-4" type="submit" disabled>Submit</button> </div></form></div> <h3 class="relative group"><a id="2-데이터-세트의-최대-길이로-모든-시퀀스를-패딩하는-대신-동적-패딩을-사용하는-이유는-무엇인가요" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#2-데이터-세트의-최대-길이로-모든-시퀀스를-패딩하는-대신-동적-패딩을-사용하는-이유는-무엇인가요"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>2. 데이터 세트의 최대 길이로 모든 시퀀스를 패딩하는 대신 동적 패딩을 사용하는 이유는 무엇인가요?</span></h3> <div><form><label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="0"> <!-- HTML_TAG_START -->동적 패딩이 모델 아키텍처에 의해 요구됩니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="1"> <!-- HTML_TAG_START -->각 배치의 최대 길이까지만 패딩하여 계산 오버헤드를 줄입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="2"> <!-- HTML_TAG_START -->모델 정확도를 향상시킵니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="3"> <!-- HTML_TAG_START -->DataCollatorWithPadding을 사용할 때 필수입니다.<!-- HTML_TAG_END --></label> <div class="flex flex-row items-center mt-3"><button class="btn px-4 mr-4" type="submit" disabled>Submit</button> </div></form></div> <h3 class="relative group"><a id="3-bert-토큰화에서-tokentypeids-필드는-무엇을-나타내나요" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#3-bert-토큰화에서-tokentypeids-필드는-무엇을-나타내나요"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>3. BERT 토큰화에서 token_type_ids 필드는 무엇을 나타내나요?</span></h3> <div><form><label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="0"> <!-- HTML_TAG_START -->시퀀스에서 각 토큰의 위치입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="1"> <!-- HTML_TAG_START -->문장 쌍을 처리할 때 각 토큰이 어느 문장에 속하는지를 나타냅니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="2"> <!-- HTML_TAG_START -->각 토큰의 어텐션 마스크입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="3"> <!-- HTML_TAG_START -->각 토큰의 어휘 ID입니다.<!-- HTML_TAG_END --></label> <div class="flex flex-row items-center mt-3"><button class="btn px-4 mr-4" type="submit" disabled>Submit</button> </div></form></div> <h3 class="relative group"><a id="4-loaddatasetglue-mrpc-로-데이터-세트를-로딩할-때-두-번째-인수는-무엇을-지정하나요" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#4-loaddatasetglue-mrpc-로-데이터-세트를-로딩할-때-두-번째-인수는-무엇을-지정하나요"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>4. load_dataset('glue', 'mrpc') 로 데이터 세트를 로딩할 때 두 번째 인수는 무엇을 지정하나요?</span></h3> <div><form><label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="0"> <!-- HTML_TAG_START -->로딩할 데이터 세트의 버전입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="1"> <!-- HTML_TAG_START -->GLUE 벤치마크 내의 특정 작업 또는 하위 집합입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="2"> <!-- HTML_TAG_START -->데이터 세트의 분할(train/validation/test)입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="3"> <!-- HTML_TAG_START -->데이터를 반환할 형식입니다.<!-- HTML_TAG_END --></label> <div class="flex flex-row items-center mt-3"><button class="btn px-4 mr-4" type="submit" disabled>Submit</button> </div></form></div> <h3 class="relative group"><a id="5-훈련-전에-sentence1과-sentence2-같은-열을-제거하는-목적은-무엇인가요" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#5-훈련-전에-sentence1과-sentence2-같은-열을-제거하는-목적은-무엇인가요"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>5. 훈련 전에 ‘sentence1’과 ‘sentence2’ 같은 열을 제거하는 목적은 무엇인가요?</span></h3> <div><form><label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="0"> <!-- HTML_TAG_START -->훈련 중 메모리를 절약하기 위해서입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="1"> <!-- HTML_TAG_START -->모델이 이러한 원시 텍스트 열을 예상하지 않고 오류가 발생할 수 있기 때문입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="2"> <!-- HTML_TAG_START -->이러한 열들이 평가에 필요하지 않기 때문입니다.<!-- HTML_TAG_END --></label> <label class="block"><input autocomplete="off" class="form-input -mt-1.5 mr-2" name="choice" type="checkbox" value="3"> <!-- HTML_TAG_START -->훈련 속도를 크게 향상시키기 때문입니다.<!-- HTML_TAG_END --></label> <div class="flex flex-row items-center mt-3"><button class="btn px-4 mr-4" type="submit" disabled>Submit</button> </div></form></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-17xh5q0">💡 <strong>핵심 요점</strong></p> <ul data-svelte-h="svelte-mib1et"><li>전처리를 훨씬 빠르게 하려면 <code>Dataset.map()</code>에서 <code>batched=True</code>를 사용하세요</li> <li><code>DataCollatorWithPadding</code>을 사용한 동적 패딩이 고정 길이 패딩보다 효율적입니다</li> <li>모델의 추론 결과물(수치적 텐서, 올바른 열 이름)에 맞게 항상 데이터를 전처리하세요</li> <li>🤗 Datasets 라이브러리는 대규모 효율적인 데이터 처리를 위한 강력한 도구를 제공합니다</li></ul></div> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/ko/chapter3/2.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_5pcgyl = { | |
| assets: "/docs/course/pr_1069/ko", | |
| base: "/docs/course/pr_1069/ko", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1069/ko/_app/immutable/entry/start.667ba29b.js"), | |
| import("/docs/course/pr_1069/ko/_app/immutable/entry/app.65e35200.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 21], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 89 kB
- Xet hash:
- 3e7affa478162861d971f1a8f0269e88443f36b7a0b1d25bf138c6594372fa4a
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.