Handle Files
Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.
The general strategy is to use a LangChain document loader or other method to parse files into a text format that can be fed into LLMs.
LangChain features a large number of document loader integrations.
Letโs go over an example of loading and extracting data from a PDF. First, we install required dependencies:
- npm
- yarn
- pnpm
npm i @langchain/openai zod
yarn add @langchain/openai zod
pnpm add @langchain/openai zod
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
// Only required in a Deno notebook environment to load the peer dep.
import "pdf-parse";
const loader = new PDFLoader("./test/data/bitcoin.pdf");
const docs = await loader.load();
[Module: null prototype] { default: [AsyncFunction: PDF] }
Now that weโve loaded a PDF document, letโs try extracting mentioned people. We can define a schema like this:
import { z } from "zod";
const personSchema = z
.object({
name: z.optional(z.string()).describe("The name of the person"),
hair_color: z
.optional(z.string())
.describe("The color of the person's hair, if known"),
height_in_meters: z
.optional(z.string())
.describe("Height measured in meters"),
email: z.optional(z.string()).describe("The person's email, if present"),
})
.describe("Information about a person.");
const peopleSchema = z.object({
people: z.array(personSchema),
});
And then initialize our extraction chain like this:
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
const SYSTEM_PROMPT_TEMPLATE = `You are an expert extraction algorithm.
Only extract relevant information from the text.
If you do not know the value of an attribute asked to extract, you may omit the attribute's value.`;
const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT_TEMPLATE],
["human", "{text}"],
]);
const llm = new ChatOpenAI({
model: "gpt-4-0125-preview",
temperature: 0,
});
const extractionRunnable = prompt.pipe(
llm.withStructuredOutput(peopleSchema, { name: "people" })
);
Now, letโs try invoking it!
await extractionRunnable.invoke({ text: docs[0].pageContent });
{ people: [ { name: "Satoshi Nakamoto", email: "satoshin@gmx.com" } ] }