Skip to main content

POST v1/Extract

POST
https://api.subworkflow.ai/v1/extract

Upload a file for extraction

Summary

  • Only splits and generates dataset items and filtering index
  • If you want to enable searching, please use the v1/vectorize endpoint instead.
  • If you want to vectorize an existing dataset already uploaded through v1/extract, you can use the v1/datasets/:id/vectorize endpoint to trigger the vectorize process.
  • Total upload file size limit is determined by your subscription plan but there's also a technical limit for this endpoint which is 100mb. Use the v1/upload-session for files larger than 100mb.
  • Ability to upload is also dependent on your remaining data storage allocation. If the uploaded file on top of your current allocation exceeds the max limit for your subscription, the upload will fail. Delete existing datasets to free up this capacity or upgrade your subscription.

Parameters

nametypelocationrequireddescription
content-typestringheaderrequiredYou must set the request content type to multipart/form-data
filefilebodyrequiredThe file you want to upload. Accepted file formats: pdf,docx,pptx,xlsx. Max 100mb.
expiresInDaysnumberbodyoptionalOverrides the default expiration time for the resulting dataset in "days" from created date.

Response

Once the file is uploaded successfully, you'll receive a jobs response which displays the details of the job tracking the extract request. Take note of the following properties:

  • id - you can use this with the v1/jobs/:id to get an updated version of the job
  • datasetId - this is the dataset record created for the upload file. You'll need this to fetch the dataset and dataset items when the job is finished.
  • status - this is the progress of the job. "SUCCESS" and "ERROR" are the finished states you should check for.
{
"type": "object",
"properties": {
"success": { "type": "boolean" },
"total": { "type": "number" },
"data": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"datasetId": {
"type": "string"
},
"type": {
"type": "string",
"enum": ["datasets/extract","datasets/vectorize"]
},
"status": {
"type": "string",
"enum": ["NOT_STARTED","IN_PROGRESS","SUCCESS","ERROR"]
},
"statusText": {
"type": "string"
},
"startedAt": {
"type": "number"
},
"finishedAt": {
"type": "number"
},
"canceledAt": {
"type": "number"
},
"createdAt": {
"type": "number"
},
"updatedAt": {
"type": "number"
}
}
}
}
}

Example

curl -X POST https://api.subworkfow.ai/v1/extract
--header 'x-api-key: <YOUR-API-KEY>'
--header 'Content-Type: multipart/form-data'
--form "file=@/path/to/file"
// note: you can poll https://api.subworkflow.ai/v1/jobs/<jobId> for updates
{
"success": true,
"total": 1,
"data": {
"id": "dsj_5fwR7qoMXracJQaf",
"datasetId": "ds_VV08ECeQBQgDoVn6",
"type": "datasets/extract",
"status": "IN_PROGRESS",
"statusText": null,
"startedAt": null,
"finishedAt": null,
"canceledAt": null,
"createdAt": 1761910647113,
"updatedAt": 1761910647113
}
}