POST v1/Extract
POST
https://api.subworkflow.ai/v1/extract
Upload a file for extraction
Summary
- Only splits and generates dataset items and filtering index
- If you want to enable searching, please use the
v1/vectorizeendpoint instead. - If you want to vectorize an existing dataset already uploaded through
v1/extract, you can use thev1/datasets/:id/vectorizeendpoint to trigger the vectorize process. - You must provide one of "file" or "url" but not both.
- For Binary Uploads: Total upload file size limit is determined by your subscription plan but there's also a technical limit for this endpoint which is 100mb. Use the
v1/upload-sessionfor files larger than 100mb. - For URL Uploads: An alternative way to upload files including larger files over 100mb is through URL. By providing a URL instead of the binary, Subworkflow will fetch the file from a source location.
- Ability to upload is also dependent on your remaining data storage allocation. If the uploaded file on top of your current allocation exceeds the max limit for your subscription, the upload will fail. Delete existing datasets to free up this capacity or upgrade your subscription.
Parameters
| name | type | location | required | description |
|---|---|---|---|---|
| content-type | string | header | required | You must set the request content type to multipart/form-data |
| file | blob | body | Required* | The file binary you want to upload. Accepted file formats: pdf,docx,pptx,xlsx. Max 100mb. *Required if "url" parameter is not provided. |
| url | string | body | Required* | The file url you want to upload. Must be publicly accessible for duration of upload. *Required if "file" parameter is not provided. |
| expiresInDays | number | body | optional | Overrides the default expiration time for the resulting dataset in "days" from created date. |
Response
Once the file is uploaded successfully, you'll receive a jobs response which displays the details of the job tracking the extract request. Take note of the following properties:
id- you can use this with thev1/jobs/:idto get an updated version of the jobdatasetId- this is the dataset record created for the upload file. You'll need this to fetch the dataset and dataset items when the job is finished.status- this is the progress of the job. "SUCCESS" and "ERROR" are the finished states you should check for.
- Success
- 400 Error
- 404 Error
{
"type": "object",
"properties": {
"success": { "type": "boolean" },
"total": { "type": "number" },
"data": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"datasetId": {
"type": "string"
},
"type": {
"type": "string",
"enum": ["datasets/extract","datasets/vectorize"]
},
"status": {
"type": "string",
"enum": ["NOT_STARTED","IN_PROGRESS","SUCCESS","ERROR"]
},
"statusText": {
"type": "string"
},
"startedAt": {
"type": "number"
},
"finishedAt": {
"type": "number"
},
"canceledAt": {
"type": "number"
},
"createdAt": {
"type": "number"
},
"updatedAt": {
"type": "number"
}
}
}
}
}
{
"success": { "type": "boolean" },
"error": { "type": "string" }
}
{
"success": { "type": "boolean" },
"error": { "type": "string" }
}
Example
- Curl
- JS/TS
curl -X POST https://api.subworkfow.ai/v1/extract
--header 'x-api-key: <YOUR-API-KEY>'
--header 'Content-Type: multipart/form-data'
--form "file=@/path/to/file.pdf"
// or example using a url
curl -X POST https://api.subworkfow.ai/v1/extract
--header 'x-api-key: <YOUR-API-KEY>'
--header 'Content-Type: multipart/form-data'
--form "url=https://arxiv.org/pdf/1706.03762"
note: Returns a Job object. You will need to poll https://api.subworkflow.ai/v1/jobs/<jobId> for updates
{
"success": true,
"total": 1,
"data": {
"id": "dsj_5fwR7qoMXracJQaf",
"datasetId": "ds_VV08ECeQBQgDoVn6",
"type": "datasets/extract",
"status": "IN_PROGRESS",
"statusText": null,
"startedAt": null,
"finishedAt": null,
"canceledAt": null,
"createdAt": 1761910647113,
"updatedAt": 1761910647113
}
}
import { Subworkflow } from "@subworkflow/sdk";
const subworkflow = new Subworkflow({ apiKey: "<YOUR-API-KEY>" });
const fileInput = fs.readFileSync("/path/to/file.pdf");
const dataset = await subworkflow.extract(fileInput, { fileName: "file.pdf" });
// or example using a url
const dataset = await subworkflow.extract(new URL("https://arxiv.org/pdf/1706.03762"), { fileName: "file.pdf" });
note: Returns a Dataset object. The SDK handles polling automatically by default.
{
"id": "ds_VV08ECeQBQgDoVn6",
"workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
"type": "doc",
"itemCount": 1,
"fileName": "file_AIpNsoTx4OkRNY3H",
"fileExt": "pdf",
"mimeType": "application/pdf",
"fileSize": 136056,
"createdAt": 1761910646651,
"updatedAt": 1761910646651,
"expiresAt": 1761910646651,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
"token": "VkVBNh",
"expiresAt": 1761910891643
}
}