Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest fails with Cannot read properties of undefined (reading 'text') #472

Closed
dmotxilla opened this issue Jul 29, 2024 · 5 comments
Closed
Labels
kind: bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@dmotxilla
Copy link

The README says:

Visit [pinecone](https://pinecone.io/) to create and retrieve your API keys, and also retrieve your environment and index name from the dashboard.

but I do not see an "environment" variable.
I create a pinecone account, I get the api key (and I use it for OPENAI_API_KEY), I create an index with name "index1" (and I use it for PINECONE_INDEX_NAME), with default parameters. I see a "Region" parameter with value "us-east-1". Is that what I need to use for PINECONE_ENVIRONMENT ?

image

but then, the ingest fails with Cannot read properties of undefined (reading 'text').

$ yarn run ingest

creating vector store...
error TypeError: Cannot read properties of undefined (reading 'text')
    at /home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:44:57
    at step (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:33:23)
    at Object.next (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:14:53)
    at /home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:8:71
    at new Promise (<anonymous>)
    at __awaiter (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:4:12)
    at extractMessage (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:40:48)
    at /home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:66:70
    at step (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:33:23)
    at Object.next (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:14:53)

file:///home/david/test1/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:46
    throw new Error('Failed to ingest your data');
          ^
Error: Failed to ingest your data
    at run (file:///home/david/test1/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:46:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at file:///home/david/test1/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:51:3
@dosubot dosubot bot added the kind: bug Related to a bug, vulnerability, unexpected error with an existing feature label Jul 29, 2024
Copy link

dosubot bot commented Jul 29, 2024

Hey @dmotxilla! 👋 I'm here to help you with any bugs, questions, or contributions you have. Let's squash those bugs together!

The error Cannot read properties of undefined (reading 'text') is likely due to an incorrect PINECONE_ENVIRONMENT value. The "Region" parameter (us-east-1) you mentioned is not the correct format for PINECONE_ENVIRONMENT. Instead, it should be something like us-east1-gcp.

Here is the corrected .env configuration:

OPENAI_API_KEY=

# Update these with your pinecone details from your dashboard. 
# PINECONE_INDEX_NAME is in the indexes tab under "index name" in blue
# PINECONE_ENVIRONMENT is in indexes tab under "Environment". Example: "us-east1-gcp"
PINECONE_API_KEY=  
PINECONE_ENVIRONMENT=us-east1-gcp
PINECONE_INDEX_NAME=index1

Ensure that all the required environment variables (OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_ENVIRONMENT, and PINECONE_INDEX_NAME) are correctly set and not left empty. Missing or incorrect values can lead to undefined objects, causing the error you are encountering [1].

Additionally, verify the structure of the documents being processed. Ensure that the documents loaded by directoryLoader.load() and the documents returned by textSplitter.splitDocuments(rawDocs) have the text property. You can add logging to inspect the structure of these documents:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

const filePath = 'docs';

export const run = async () => {
  try {
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path),
    });

    const rawDocs = await directoryLoader.load();
    console.log('rawDocs', rawDocs); // Log rawDocs to inspect structure

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs); // Log split docs to inspect structure

    console.log('creating vector store...');
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME);

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'text',
    });
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

By following these steps, you should be able to identify and resolve the issue [2].

To continue talking to Dosu, mention @dosu.

@dmotxilla
Copy link
Author

thanks for the answer.
The "Region" parameter "us-east-1" was indeed not correct.
I now tried with the value "us-east1-gcp", and it works better, but it still fails.
(by the way, in the pinecone website, I don't see this "us-east1-gcp" value anywhere, nor a "Environment" parameter. Maybe because I have the free account with the Default project? Where do I see the name of the environment, can you please make a screenshot?)

Anyway, using the value "us-east1-gcp" as PINECONE_ENVIRONMENT works a bit better, but it still fails:

the documents returned by textSplitter.splitDocuments(rawDocs) do not have a 'text' property. They have a pageContent instead. Is this ok? Anyway, it does not complain about the text property anymore. It complains about "Verify you have the correct environment, project id, and index name configured.".
The index name is correct, I guess, as shown in the screenshot in the previous message. The API key is also correct, I think. The environment, I am not sure. If I use "us-east-1" or something like "hahha", I get the text error. However, using the "us-east1-gcp" value, if fails with something different. Again, I checked all the web console of my pinecone account (free account), and I don't see a Environment parameter anywhere.

$ cat .env
OPENAI_API_KEY=sk-proj-QPLoQyqRqnNtzklWHrqFT3BlbkFJKm...
PINECONE_API_KEY=ae9ac4c6-fa87-49c2-bde0-21....
PINECONE_ENVIRONMENT=us-east1-gcp
PINECONE_INDEX_NAME=index1

$ yarn run ingest
yarn run v1.22.22
$ tsx -r dotenv/config scripts/ingest-data.ts
(node:2096485) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
rawDocs [
  Document {
    pageContent: '  \n' +
      '                                          Non-chemical                                          crop                                          protection                                           \n' +
      '6   \n' +
      '1     Introduction \n' +
      '1.1    What is this booklet about?  \n' +
      'As  a  farmer  of  arable  crops  or  vegetables,  you  strive  to  achieve  the   \n' +
      'highest  yield  and  the  best  quality  product  possible.  Of  course  you   \n' +
      'would prefer to do this with a minimum investment of energy and re- \n' +
      'sources,  but  you  are  continuously  bothered  by  all  kinds  of  harmful   \n' +
      'organisms (pests) that threaten to  reduce the quality and yield of your  \n' +
      'crops.  Protecting  your  crops  from  these  pests  is  extremely  important,   \n' +
      'but  it  is  difficult  to  achieve  maximum  results  with  a  minimum  of  ef- \n' +
      'fort.  You  have  to  look  not  only  at  a  measure’s  immediate  effect,  but   \n' +
      'also at its long-term effect.   \n' +
      'This  booklet  gives  an  overview  of  the  main  non-chemical  measures   \n' +
      'you can take to protect your crops from pests. Most of these measures  \n' +
      'are  preventive:  they  involve  planning  and  farming  practices  that  will   \n' +
      'help you to keep pest numbers down and limit the damage they do.  \n' +
      '1.2    Why publish a booklet about non-chemical  \n' +
      'crop protection?  \n' +
      'Non-chemical methods of crop protection have always been practised,  \n' +
      'but the introduction of chemical pesticides a few decades ago seemed  \n' +
      'to  make  crop  protection  a  lot  easier.  As  a  farmer,  you  suddenly  only   \n' +
      'needed  to  know  which  particular  pests  you  were  dealing  with,  which   \n' +
      'pesticides  were  available  to  control  them,  and  how  to  apply  these   \n' +
      'products safely. These products were so effective that it looked at first  \n' +
      'as  though  all  pests  could  be  eradicated  in  this  way.  In  practice,  how- \n' +
      'ever,  the  pests  were  not  actually  eradicated,  because  they  came  back   \n' +
      'every growing season. Many natural enemies were temporarily wiped  \n' +
      'out along with the pests, which gave the pests the opportunity to mul- \n' +
      'tiply even more explosively than before (see case 1 for an example).  \n' +
      'To ensure a healthy crop, it was often necessary to spray several times  \n' +
      'per season just to control one type of pest. Eventually, some pesticides ',
    metadata: {
      source: '/home/david/test1/gpt4-pdf-chatbot-langchain/docs/sample.pdf',
      pdf: [Object],
      loc: [Object]
    }
  }
]
split docs [
  Document {
    pageContent: 'Non-chemical                                          crop                                          protection                                           \n' +
      '6   \n' +
      '1     Introduction \n' +
      '1.1    What is this booklet about?  \n' +
      'As  a  farmer  of  arable  crops  or  vegetables,  you  strive  to  achieve  the   \n' +
      'highest  yield  and  the  best  quality  product  possible.  Of  course  you   \n' +
      'would prefer to do this with a minimum investment of energy and re- \n' +
      'sources,  but  you  are  continuously  bothered  by  all  kinds  of  harmful   \n' +
      'organisms (pests) that threaten to  reduce the quality and yield of your  \n' +
      'crops.  Protecting  your  crops  from  these  pests  is  extremely  important,   \n' +
      'but  it  is  difficult  to  achieve  maximum  results  with  a  minimum  of  ef- \n' +
      'fort.  You  have  to  look  not  only  at  a  measure’s  immediate  effect,  but   \n' +
      'also at its long-term effect.',
    metadata: {
      source: '/home/david/test1/gpt4-pdf-chatbot-langchain/docs/sample.pdf',
      pdf: [Object],
      loc: [Object]
    }
  },
  Document {
    pageContent: 'but  it  is  difficult  to  achieve  maximum  results  with  a  minimum  of  ef- \n' +
      'fort.  You  have  to  look  not  only  at  a  measure’s  immediate  effect,  but   \n' +
      'also at its long-term effect.   \n' +
      'This  booklet  gives  an  overview  of  the  main  non-chemical  measures   \n' +
      'you can take to protect your crops from pests. Most of these measures  \n' +
      'are  preventive:  they  involve  planning  and  farming  practices  that  will   \n' +
      'help you to keep pest numbers down and limit the damage they do.  \n' +
      '1.2    Why publish a booklet about non-chemical  \n' +
      'crop protection?  \n' +
      'Non-chemical methods of crop protection have always been practised,  \n' +
      'but the introduction of chemical pesticides a few decades ago seemed  \n' +
      'to  make  crop  protection  a  lot  easier.  As  a  farmer,  you  suddenly  only   \n' +
      'needed  to  know  which  particular  pests  you  were  dealing  with,  which   \n' +
      'pesticides  were  available  to  control  them,  and  how  to  apply  these',
    metadata: {
      source: '/home/david/test1/gpt4-pdf-chatbot-langchain/docs/sample.pdf',
      pdf: [Object],
      loc: [Object]
    }
  },
  Document {
    pageContent: 'needed  to  know  which  particular  pests  you  were  dealing  with,  which   \n' +
      'pesticides  were  available  to  control  them,  and  how  to  apply  these   \n' +
      'products safely. These products were so effective that it looked at first  \n' +
      'as  though  all  pests  could  be  eradicated  in  this  way.  In  practice,  how- \n' +
      'ever,  the  pests  were  not  actually  eradicated,  because  they  came  back   \n' +
      'every growing season. Many natural enemies were temporarily wiped  \n' +
      'out along with the pests, which gave the pests the opportunity to mul- \n' +
      'tiply even more explosively than before (see case 1 for an example).  \n' +
      'To ensure a healthy crop, it was often necessary to spray several times  \n' +
      'per season just to control one type of pest. Eventually, some pesticides',
    metadata: {
      source: '/home/david/test1/gpt4-pdf-chatbot-langchain/docs/sample.pdf',
      pdf: [Object],
      loc: [Object]
    }
  }
]
creating vector store...
error PineconeConnectionError: Request failed to reach Pinecone. Verify you have the correct environment, project id, and index name configured.
    at /home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:52:39
    at step (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:33:23)
    at Object.next (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:14:53)
    at /home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:8:71
    at new Promise (<anonymous>)
    at __awaiter (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:4:12)
    at handleFetchError (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:47:67)
    at /home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:62:54
    at step (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:33:23)
    at Object.next (/home/david/test1/gpt4-pdf-chatbot-langchain/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:14:53) {
  attemptNumber: 7,
  retriesLeft: 0
}

file:///home/david/test1/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:47
    throw new Error('Failed to ingest your data');
          ^
Error: Failed to ingest your data
    at run (file:///home/david/test1/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:47:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at file:///home/david/test1/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:52:3
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Copy link

dosubot bot commented Jul 29, 2024

The PINECONE_ENVIRONMENT value should indeed be something like us-east1-gcp, but it seems you are still encountering issues. Here are a few steps to help you troubleshoot further:

  1. Verify Pinecone Environment:

    • The exact environment string can be found in the Pinecone dashboard under the "Indexes" tab. Look for the "Environment" field. If you cannot find it, it might be due to the limitations of the free account. In that case, you can try reaching out to Pinecone support for clarification.
  2. Check Document Structure:

    • The documents returned by textSplitter.splitDocuments(rawDocs) having a pageContent property instead of text is fine. You can adjust the textKey parameter in the PineconeStore.fromDocuments method to use pageContent instead of text.
  3. Update Code:

    • Modify the textKey parameter to match the property name in your documents (pageContent).

Here’s the updated code snippet:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

const filePath = 'docs';

export const run = async () => {
  try {
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path),
    });

    const rawDocs = await directoryLoader.load();
    console.log('rawDocs', rawDocs); // Log rawDocs to inspect structure

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs); // Log split docs to inspect structure

    console.log('creating vector store...');
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME);

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'pageContent', // Use 'pageContent' instead of 'text'
    });
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();
  1. Verify Environment Variables:

    • Double-check that all environment variables are correctly set in your .env file:
      OPENAI_API_KEY=sk-proj-QPLoQyqRqnNtzklWHrqFT3BlbkFJKm...
      PINECONE_API_KEY=ae9ac4c6-fa87-49c2-bde0-21....
      PINECONE_ENVIRONMENT=us-east1-gcp
      PINECONE_INDEX_NAME=index1
  2. Pinecone Index Configuration:

    • Ensure that your Pinecone index is correctly configured with the required vector dimensions (e.g., 1536) and other necessary settings.

By following these steps, you should be able to resolve the issues and successfully ingest your data [1].

To continue talking to Dosu, mention @dosu.

@dmotxilla
Copy link
Author

I asked Pinecone about it:
https://community.pinecone.io/t/how-to-find-out-what-is-my-environment-name-in-a-free-pinecone-account/6253

Reply:
It looks like that code is using the older connection method which is not supported with Serverless indexes.

gpt4-pdf-chatbot-langchain needs to be updated to use the latest version of the NodeJS SDK.

@dmotxilla
Copy link
Author

I close this issue, as the original subject was solved, and I created a new ticket with the current issue.
#473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant