Getting Started
Subworkflow.ai is an API Service and our core product is our Datasets API. The Datasets API is a simple REST-based API for file processing which is designed to be used within and as part of your own applications. The following is a quick start guide on the intended workflow. You can read API references later to discover advanced configurations to suit your use-case.
Before you being...
Subworkflow.ai can be a powerful utility for AI projects but to get the most out of the service, we recommend the following:
- Familiarity with working with REST APIs - Currently, this is the only way to use the Datasets API service - our SDKs are in the works! Being a REST API however, does mean we're universally compatible with any language or platform out there and we'll include some platform examples later in the docs.
- Have an actual need to handle large documents - Projects dealing with larger document workloads will see the most benefits and impact of the Subworkflow.ai service. If you typically handle documents < 50 pages or do not use VLMs, file processing performance might seem negligible but you might still benefit from the retrieval and search indexing features.
1. Sign up and Register your Organisation
First, sign up to register an account and your organisation (if not invited to one). An organisation is where you'll manage your workspaces, datasets and team as well as access controls and billing.
2. Subscribe to a Starter Plan Free Trial
Unfortunately, we do not currently have a free tier plan but offer a 14 day free trial for our starter plan. With the starter plan, customers can process documents of up to 50mb - more than enough to handle documents of ~1000 pages (*dependent on content and compression). Subscriptions are seat based but also upgrades to higher tiers for higher file size upload limits, storage capacity and concurrent jobs.
3. Create a Workspace and generate API Key
Workspaces help organise and scope access to documents (or rather Datasets) for your team, clients or projects. API Keys are also workspace-scoped meaning they are only valid for the workspace they are generated for. In the Workspace > Settings page, create a new API key and copy the value when available.
4. Upload your first Dataset via the API
Next, we're going to transform a document into a Subworkflow Dataset by uploading it to our /extract API endpoint. With this one action, we can achieve quite a few things:
- Automatic conversion to PDF if applicable - supported document formats include
docx,pptxandxlsx. - Split the document into individual pages - pages are in PDF format
- Generate an image copy of each page - page copies are in JPG format for VLMs
- Compiles a retrieval index from pages - making it possible to fetch any page or a range of pages quickly and efficiently
Once uploaded successfully, you receive a job response due to asynchronous nature of the /extract endpoint. You'll need the job id to check when the job is completed successfully before you can start retrieving the dataset.
- Curl
- JS/TS
curl -X POST https://api.subworkfow.ai/v1/extract
--header 'x-api-key: <YOUR-API-KEY>'
--header 'Content-Type: multipart/form-data'
--form "file=@/path/to/file"
const formdata = new FormData();
formdata.append("file", fileInput);
const req = await fetch("https://api.subworkflow.ai/v1/extract", {
method: "POST",
headers: {
"Content-Type": "multipart/form-data",
"x-api-key": "<YOUR-API-KEY>"
},
body: formdata,
});
// note: you can poll https://api.subworkflow.ai/v1/jobs/<jobId> for updates
{
"success": true,
"total": 1,
"data": {
"id": "dsj_5fwR7qoMXracJQaf",
"datasetId": "ds_VV08ECeQBQgDoVn6",
"type": "datasets/extract",
"status": "IN_PROGRESS",
"statusText": null,
"startedAt": null,
"finishedAt": null,
"canceledAt": null,
"createdAt": 1761910647113,
"updatedAt": 1761910647113
}
}
Since API keys are scoped by workspace, all files uploaded using them are placed in their respective workspaces. This goes for the original file, converted copies and any generated pages/embeddings. Also displayed via the Subworkflow.ai web portal.
5. Retrieve Your Dataset and its Items
Once the job is complete, you can retrieve the dataset and/or any page to use in your AI application or workflow. In the following example, replace the <datasetId> parameter with the datasetId value be obtained from the job response.
Retrieving the Dataset
Requesting the dataset can provide you a link to a pdf-version of the original document and tell you how many pages were contained within (itemCount). Typically, you'll fetch the dataset record only for the metadata needed to query over its Dataset Items.
- Curl
- JS/TS
curl https://api.subworkfow.ai/v1/datasets/<datasetId>
--header 'x-api-key: <YOUR-API-KEY>'
const req = await fetch("https://api.subworkflow.ai/v1/datasets/<datasetId>", {
method: "GET",
headers: {
"Content-Type": "application/json",
"x-api-key": "<YOUR-API-KEY>"
}
});
{
"sucess": true,
"total": 1,
"data": {
"id": "ds_VV08ECeQBQgDoVn6",
"workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
"type": "doc",
"itemCount": 1,
"fileName": "file_AIpNsoTx4OkRNY3H",
"fileExt": "pdf",
"mimeType": "application/pdf",
"fileSize": 136056,
"createdAt": 1761910646651,
"updatedAt": 1761910646651,
"expiresAt": 1761910646651,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
"token": "VkVBNh",
"expiresAt": 1761910891643
}
}
}
Querying for a range of Dataset Items
In this example, we retrieve the equivalent to the 1st, 3rd and 5th pages from our document in jpeg format. Notice that this is particular powerful when handling large documents (1000+ pages) - you don't necessarily need to receive full dataset as you do with other services, pick out only the pages you want! Other cols patterns include the range modifier where cols=50:100 will return pages from 50 to 100. For full details on querying options, please refer to the API reference documentation.
- Curl
- JS/TS
curl https://api.subworkfow.ai/v1/datasets/<datasetId>/items?row=jpg&cols=1,3,5
--header 'x-api-key: <YOUR-API-KEY>'
const req = await fetch("https://api.subworkflow.ai/v1/datasets/<datasetId>/items?row=jpg&cols=1,3,5", {
method: "GET",
headers: {
"Content-Type": "application/json",
"x-api-key": "<YOUR-API-KEY>"
}
});
{
"sort": ["-createdAt"],
"offset": 0,
"limit": 10,
"total": 3,
"data": [
{
"id": "dsx_B5bsOBDzsXsqfmLo",
"col": 1,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_1muCWQXZ58r5PsjC",
"col": 3,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_1muCWQXZ58r5PsjC?token=Qqkk7U",
"token": "Qqkk7U",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_0yIaKxZjiZIXc1G3",
"col": 5,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_0yIaKxZjiZIXc1G3?token=7GQKco",
"token": "7GQKco",
"expiresAt": 1761911418809
}
}
]
}
6. Next steps
Congrats! 🎉 If you've managed to make it this far, you've pretty much mastered the the Datasets API! Subworkflow.ai is built on the principle of being simple yet effective but please let us know if there's more we could improve.
Head on over to the next section on ways to use the Datasets API in your application and workflows.