POST v1/Extract

POST
https://api.subworkflow.ai/v1/extract

Upload a file for extraction

Summary

Only splits and generates dataset items and filtering index
If you want to enable searching, please use the v1/vectorize endpoint instead.
If you want to vectorize an existing dataset already uploaded through v1/extract, you can use the v1/datasets/:id/vectorize endpoint to trigger the vectorize process.
You must provide one of "file" or "url" but not both.
For Binary Uploads: Total upload file size limit is determined by your subscription plan but there's also a technical limit for this endpoint which is 100mb. Use the v1/upload-session for files larger than 100mb.
For URL Uploads: An alternative way to upload files including larger files over 100mb is through URL. By providing a URL instead of the binary, Subworkflow will fetch the file from a source location.
Ability to upload is also dependent on your remaining data storage allocation. If the uploaded file on top of your current allocation exceeds the max limit for your subscription, the upload will fail. Delete existing datasets to free up this capacity or upgrade your subscription.

Parameters

name	type	location	required	description
content-type	string	header	required	You must set the request content type to `multipart/form-data`
file	blob	body	Required*	The file binary you want to upload. Accepted file formats: `pdf`,`docx`,`pptx`,`xlsx`. Max 100mb. *Required if "url" parameter is not provided.
url	string	body	Required*	The file url you want to upload. Must be publicly accessible for duration of upload. *Required if "file" parameter is not provided.
expiresInDays	number	body	optional	Overrides the default expiration time for the resulting dataset in "days" from created date.

Response

Once the file is uploaded successfully, you'll receive a jobs response which displays the details of the job tracking the extract request. Take note of the following properties:

id - you can use this with the v1/jobs/:id to get an updated version of the job
datasetId - this is the dataset record created for the upload file. You'll need this to fetch the dataset and dataset items when the job is finished.
status - this is the progress of the job. "SUCCESS" and "ERROR" are the finished states you should check for.

Success
400 Error
404 Error

{
    "type": "object",
    "properties": {
        "success": { "type": "boolean" },
        "total": { "type": "number" },
        "data": {
            "type": "object",
            "properties": {
                "id": {
                    "type": "string"
                },
                "datasetId": {
                    "type": "string"
                },
                "type": {
                    "type": "string",
                    "enum": ["datasets/extract","datasets/vectorize"]
                },
                "status": {
                    "type": "string",
                    "enum": ["NOT_STARTED","IN_PROGRESS","SUCCESS","ERROR"]
                },
                "statusText": {
                    "type": "string"
                },
                "startedAt": {
                    "type": "number"
                },
                "finishedAt": {
                    "type": "number"
                },
                "canceledAt": {
                    "type": "number"
                },
                "createdAt": {
                    "type": "number"
                },
                "updatedAt": {
                    "type": "number"
                }
            }
        }
    }
}

{
    "success": { "type": "boolean" },
    "error": { "type": "string" }
}

{
    "success": { "type": "boolean" },
    "error": { "type": "string" }
}

Example

Curl
JS/TS

curl -X POST https://api.subworkfow.ai/v1/extract
    --header 'x-api-key: <YOUR-API-KEY>'
    --header 'Content-Type: multipart/form-data'
    --form "file=@/path/to/file.pdf"

// or example using a url

curl -X POST https://api.subworkfow.ai/v1/extract
    --header 'x-api-key: <YOUR-API-KEY>'
    --header 'Content-Type: multipart/form-data'
    --form "url=https://arxiv.org/pdf/1706.03762"

note: Returns a Job object. You will need to poll https://api.subworkflow.ai/v1/jobs/<jobId> for updates

{
    "success": true,
    "total": 1,
    "data": {
        "id": "dsj_5fwR7qoMXracJQaf",
        "datasetId": "ds_VV08ECeQBQgDoVn6",
        "type": "datasets/extract",
        "status": "IN_PROGRESS",
        "statusText": null,
        "startedAt": null,
        "finishedAt": null,
        "canceledAt": null,
        "createdAt": 1761910647113,
        "updatedAt": 1761910647113
    }
}

import { Subworkflow } from "@subworkflow/sdk";
const subworkflow = new Subworkflow({ apiKey: "<YOUR-API-KEY>" });

const fileInput = fs.readFileSync("/path/to/file.pdf");
const dataset = await subworkflow.extract(fileInput, { fileName: "file.pdf" });

// or example using a url
const dataset = await subworkflow.extract(new URL("https://arxiv.org/pdf/1706.03762"), { fileName: "file.pdf" });

note: Returns a Dataset object. The SDK handles polling automatically by default.

{
    "id": "ds_VV08ECeQBQgDoVn6",
    "workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
    "type": "doc",
    "itemCount": 1,
    "fileName": "file_AIpNsoTx4OkRNY3H",
    "fileExt": "pdf",
    "mimeType": "application/pdf",
    "fileSize": 136056,
    "createdAt": 1761910646651,
    "updatedAt": 1761910646651,
    "expiresAt": 1761910646651,
    "share": {
        "url": "https://api.subworkflow.ai/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
        "token": "VkVBNh",
        "expiresAt": 1761910891643
    }
}

Summary​

Parameters​

Response​

Example​

Summary

Parameters

Response

Example