Skip to main content

POST v1/Vectorize

POST
https://api.subworkflow.ai/v1/vectorize

Upload a file for extraction and vectorization

Summary

  • Splits and generates dataset items and filtering index
  • Generates image embeddings for all dataset items and populates a workspace-scoped vector store.
  • Typically, a longer processing time so optional if you only need the filtering index, use the v1/extract endpoint instead.
  • If you want to vectorize an existing dataset already uploaded through v1/extract, you can use the v1/datasets/:id/vectorize endpoint to trigger the vectorize process.
  • You must provide one of "file" or "url" but not both.
  • For Binary Uploads: Total upload file size limit is determined by your subscription plan but there's also a technical limit for this endpoint which is 100mb. Use the v1/upload-session for files larger than 100mb.
  • For URL Uploads: An alternative way to upload files including larger files over 100mb is through URL. By providing a URL instead of the binary, Subworkflow will fetch the file from a source location.
  • Ability to upload is also dependent on your remaining data storage allocation. If the uploaded file on top of your current allocation exceeds the max limit for your subscription, the upload will fail. Delete existing datasets to free up this capacity or upgrade your subscription.

Parameters

nametypelocationrequireddescription
content-typestringheaderrequiredYou must set the request content type to multipart/form-data
fileblobbodyRequired*The file binary you want to upload. Accepted file formats: pdf,docx,pptx,xlsx. Max 100mb. *Required if "url" parameter is not provided.
urlstringbodyRequired*The file url you want to upload. Must be publicly accessible for duration of upload. *Required if "file" parameter is not provided.
expiresInDaysnumberbodyoptionalOverrides the default expiration time for the resulting dataset in "days" from created date.

Response

Once the file is uploaded successfully, you'll receive a jobs response which displays the details of the job tracking the extract request. Take note of the following properties:

  • id - you can use this with the v1/jobs/:id to get an updated version of the job
  • datasetId - this is the dataset record created for the upload file. You'll need this to fetch the dataset and dataset items when the job is finished.
  • status - this is the progress of the job. "SUCCESS" and "ERROR" are the finished states you should check for.
{
"type": "object",
"properties": {
"success": { "type": "boolean" },
"total": { "type": "number" },
"data": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"datasetId": {
"type": "string"
},
"type": {
"type": "string",
"enum": ["datasets/extract","datasets/vectorize"]
},
"status": {
"type": "string",
"enum": ["NOT_STARTED","IN_PROGRESS","SUCCESS","ERROR"]
},
"statusText": {
"type": "string"
},
"startedAt": {
"type": "number"
},
"finishedAt": {
"type": "number"
},
"canceledAt": {
"type": "number"
},
"createdAt": {
"type": "number"
},
"updatedAt": {
"type": "number"
}
}
}
}
}

Example

curl -X POST https://api.subworkfow.ai/v1/vectorize
--header 'x-api-key: <YOUR-API-KEY>'
--header 'Content-Type: multipart/form-data'
--form "file=@/path/to/file"

// or example using a url

curl -X POST https://api.subworkfow.ai/v1/vectorize
--header 'x-api-key: <YOUR-API-KEY>'
--header 'Content-Type: multipart/form-data'
--form "url=https://arxiv.org/pdf/1706.03762"

note: Returns a Job object. You will need to poll https://api.subworkflow.ai/v1/jobs/<jobId> for updates

{
"success": true,
"total": 1,
"data": {
"id": "dsj_5fwR7qoMXracJQaf",
"datasetId": "ds_VV08ECeQBQgDoVn6",
"type": "datasets/vectorize", // note: you may see `datasets/extract` initially
"status": "IN_PROGRESS",
"statusText": null,
"startedAt": null,
"finishedAt": null,
"canceledAt": null,
"createdAt": 1761910647113,
"updatedAt": 1761910647113
}
}