bulkyard: A Salesforce CLI Plugin for Bulk Data Operations via SQLite
bulkyard: A Salesforce CLI Plugin for Bulk Data Operations via SQLite
Because copy-pasting SOQL into Data Loader for the fifth time this sprint shouldn’t be a viable workflow.
If you’ve followed along on this blog, you know I have a fondness for working with Salesforce data locally. Back in 2024, I wrote about using SQLite3 as a staging area for Salesforce data work, and earlier this year I covered Jupyter notebooks as a discovery tool for new org engagements. The through-line in both posts was the same idea: pull data out of Salesforce, work on it locally, and push the results back.
The Python approach works well. simple_salesforce is a solid library, and pandas makes data manipulation pleasant. But there’s always been some boilerplate to write: authenticate, describe the object to get field names, build the SOQL query, call the Bulk API, juggle the CSVs, load them into SQLite, remember what you did and in what order. Do it for one object and it’s fine. Do it for a dozen objects as part of a repeatable data migration? It starts to feel like homework.
So I built bulkyard, a Salesforce CLI plugin that handles the extract/load cycle between Salesforce and a local SQLite3 database. It’s open-source, installable in seconds, and driven by a YAML config file that makes multi-object operations repeatable by design.
Installing It
If you have the Salesforce CLI installed, adding bulkyard is one line:
sf plugins install bulkyard
That’s it. No virtual environments, no pip install, no requirements.txt to manage.
The Config File
The heart of bulkyard is a YAML configuration file that describes what you want to extract or load, and from/to where. Here’s a simple example:
database: bulkyard.db
objects:
- sobject: Account
query: "SELECT Id, Name, Industry, Website FROM Account WHERE CreatedDate = LAST_N_DAYS:90"
- sobject: Contact
query: "SELECT Id, AccountId, FirstName, LastName, Email FROM Contact WHERE Account.CreatedDate = LAST_N_DAYS:90"
This tells bulkyard to extract two objects into a local SQLite file called bulkyard.db. The tables are named after the SObject by default, though you can override that too.
The value of a config file over a one-off script makes it so that, six weeks from now, you (or a colleague) can open the YAML and immediately understand exactly what was extracted and why. If you can’t run it again cleanly, it’s not a workflow.
Extracting Data
With the config in place, extracting is straightforward:
sf bulkyard extract --config-file config.yaml --target-org my-sandbox
bulkyard uses the Bulk API 2.0 under the hood. It describes each object to build proper type mappings, creates typed SQLite tables, and inserts records in transactional batches of 1,000 rows. You end up with a local .db file that you can query with any SQLite tool (DBeaver, the sqlite3 CLI, or from a Jupyter notebook if that’s your jam).
If you want to skip the config file for a quick one-off extraction, you can pass the object and query directly:
sf bulkyard extract \
--sobject Opportunity \
--query "SELECT Id, Name, StageName, Amount FROM Opportunity WHERE IsClosed = FALSE" \
--database opps.db \
--target-org my-sandbox
That kind of flexibility matters during discovery work. Sometimes you want a full config-driven pipeline. Sometimes you just want to grab a table and start poking at it.
Transforming Locally
Once the data is in SQLite, you’re in familiar territory. Everything from the 2024 data work post applies here: SQL JOINs for relationship work, Pandas for anything more complex, or even just running queries in a GUI to sanity-check what you’ve got.
The key difference is that bulkyard eliminated the extraction boilerplate. By the time you’re writing your first transformation query, the data is already there.
For example, after extracting Account and Contact, you might produce a cleaned Contact table for a migration scenario:
CREATE TABLE Contact_Clean AS
SELECT
c.Id,
c.AccountId,
c.FirstName,
c.LastName,
c.Email
FROM Contact c
INNER JOIN Account a ON c.AccountId = a.Id
WHERE c.Email IS NOT NULL
AND a.Industry IS NOT NULL;
Loading Data Back
When you’re ready to push data back into Salesforce, bulkyard load handles it:
sf bulkyard load \
--sobject Contact \
--operation upsert \
--external-id-field Id \
--database opps.db \
--target-org my-production-org
The --operation flag supports insert, update, upsert, and delete, so the full CRUD surface is covered. For upsert operations, you specify the external ID field. Typically, that’s Id for updates within an org, or a custom external ID field for cross-org migrations.
For multi-object loads, this is where the config file can pay off:
database: bulkyard.db
objects:
- sobject: Account
operation: upsert
external_id_field: External_Id__c
- sobject: Contact
operation: upsert
external_id_field: External_Id__c
table: Contact_Clean
The table override on Contact is how you tell bulkyard to load from Contact_Clean instead of the raw Contact table. Run it with:
sf bulkyard load --config-file config.yaml --target-org my-production-org
And that’s your full pipeline: extract, transform locally, load back. All config-driven, all repeatable.
A Lesson About jsforce
I want to share something that bit me during development, because it’ll probably bite you too if you reach for the same tooling.
When I started working on bulkyard, the obvious path for Bulk API 2.0 operations in a TypeScript SF CLI plugin is conn.bulk2.query() from @salesforce/core’s Connection class (which wraps jsforce under the hood). The API looks clean:
// The first approach — looks fine, isn't
const queryResult = await conn.bulk2.query(soqlQuery);
const records = await queryResult.toArray();
This works. Right up until you point it at an object with a meaningful number of records, at which point it times out. The fix was to drop the abstraction entirely and implement the job lifecycle directly against the REST endpoints: POST a query job, poll until it completes, then walk the paginated CSV results page by page until complete.
// The actual implementation
const jobInfo = await conn.requestPost<QueryJobInfo>(jobsUrl, {
operation: 'query',
query: soqlQuery,
});
await waitForJobCompletion(conn, jobInfo.id);
let locator: string | null = null;
do {
const url = locator ? `${resultsUrl}?locator=${encodeURIComponent(locator)}` : resultsUrl;
const response = await fetch(url, { headers: authHeaders });
locator = response.headers.get('sforce-locator');
// parse CSV page, batch insert...
} while (locator && locator !== 'null');
This is not unusual territory if you’ve worked with the Bulk API 2.0 REST spec directly, but it’s a meaningful gap in the jsforce abstraction that I didn’t know. If you’re building anything on top of the Bulk API 2.0 in TypeScript and care about completeness for large datasets, go to the REST endpoints.
Why This Approach?
What bulkyard adds is a tighter integration with the Salesforce CLI, which most developers already have installed, and a configuration model that makes the extract/load cycle feel effortless.
If you’re already using the SF CLI day-to-day, bulkyard should slot into your workflow without friction. And if you’ve been building Python scripts to wrangle data between Salesforce and SQLite, consider whether you want to keep maintaining that boilerplate or just offload the plumbing.
Check It Out
bulkyard is open-source. You can find the source on GitHub and install it from npm. Issues and PRs are welcome.
sf plugins install bulkyard
Happy data wrangling!