createWorker
is a function that creates a Tesseract.js worker. A Tesseract.js worker is an object that creates and manages an instance of Tesseract running in a web worker (browser) or worker thread (Node.js). Once created, OCR jobs are sent through the worker.
Arguments:
langs
a string to indicate the languages traineddata to download, multiple languages are specified using an array (['eng', 'chi_sim'])oem
a enum to indicate the OCR Engine Mode you useoptions
an object of customized options
corePath
path to a directory containing both tesseract-core.wasm.js
and tesseract-core-simd.wasm.js
from Tesseract.js-core package
corePath
to a specific .js
file is strongly discouraged. To provide the best performance on all devices, Tesseract.js needs to be able to pick between tesseract-core.wasm.js
and tesseract-core-simd.wasm.js
. See this issue for more detail.langPath
path for downloading traineddata, do not include /
at the end of the pathworkerPath
path for downloading worker scriptdataPath
path for saving traineddata in WebAssembly file system, not common to modifycachePath
path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDBcacheMethod
a string to indicate the method of cache management, should be one of the following optionslegacyCore
set to true
to ensure any code downloaded supports the Legacy model (in addition to LSTM model)legacyLang
set to true
to ensure any language data downloaded supports the Legacy model (in addition to LSTM model)workerBlobURL
a boolean to define whether to use Blob URL for worker script, default: truegzip
a boolean to define whether the traineddata from the remote is gzipped, default: truelogger
a function to log the progress, a quick example is m => console.log(m)
errorHandler
a function to handle worker errors, a quick example is err => console.error(err)
config
an object of customized options which are set prior to initialization
worker.setParameters
or the options
argument of worker.recognize
.load_system_dawg
, load_number_dawg
, and load_punc_dawg
Examples:
const { createWorker } = Tesseract;
const worker = await createWorker('eng', 1, {
langPath: '...',
logger: m => console.log(m),
});
worker.recognize
provides core function of Tesseract.js as it executes OCR
Figures out what words are in image
, where the words are in image
, etc.
[!TIP] Note:
image
should be sufficiently high resolution. Often, the same image will get much better results if you upscale it before callingrecognize
.
Arguments:
image
see Image Format for more details.options
an object of customized options
rectangle
an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.output
an object specifying which output formats to return (by default text
, blocks
, hocr
, and tsv
are returned)jobId
Please see details aboveOutput:
worker.recognize
returns a promise to an object containing jobId
and data
properties. The data
property contains output in all of the formats specified using the output
argument.
[!NOTE]
worker.recognize
still returns an output object even if no text is detected (the outputs will simply contain no words). No exception is thrown as determining the page is empty is considered a valid result.
Examples:
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(image);
console.log(text);
})();
With rectangle
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(image, {
rectangle: { top: 0, left: 0, width: 100, height: 100 },
});
console.log(text);
})();
worker.setParameters()
set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit_char_whitelist is very useful.
Arguments:
params
an object with key and value of the parametersjobId
Please see details aboveNote: worker.setParameters
cannot be used to change the oem
, as this value is set at initialization. oem
is initially set using an argument of createWorker
. After a worker already exists, changing oem
requires running worker.reinitialize
.
Useful Parameters:
name | type | default value | description |
---|---|---|---|
tessedit_pageseg_mode | enum | PSM.SINGLE_BLOCK | Check HERE for definition of each mode |
tessedit_char_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
preserve_interword_spaces | string | '0' | '0' or '1', keeps the space between words |
user_defined_dpi | string | '' | Define custom dpi, use to fix Warning: Invalid resolution 0 dpi. Using 70 instead. |
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by setParameters
or recognize
.)
Examples:
(async () => {
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
});
})
worker.reinitialize()
re-initializes an existing Tesseract.js worker with different langs
and oem
arguments.
Arguments:
langs
a string to indicate the languages traineddata to download, multiple languages are specified using an array (['eng', 'chi_sim'])oem
a enum to indicate the OCR Engine Mode you useconfig
an object of customized options which are set prior to initialization (see details above)jobId
Please see details aboveNote: to switch from Tesseract LSTM (oem
value 1
) to Tesseract Legacy (oem
value 0
) using worker.reinitialize()
, the worker must already contain the code required to run the Tesseract Legacy model. Setting legacyCore: true
and legacyLang: true
in createWorker
options ensures this is the case.
Examples:
await worker.reinitialize('eng', 1);
worker.detect
does OSD (Orientation and Script Detection) to the image instead of OCR.
[!NOTE]
Runningworker.detect
requires a worker with code and language data that supports Tesseract Legacy (this is not enabled by default). If you want to runworker.detect
, setlegacyCore
andlegacyLang
totrue
in thecreateWorker
options.
Arguments:
image
see Image Format for more details.jobId
Please see details aboveExamples:
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker('eng', 1, {legacyCore: true, legacyLang: true});
const { data } = await worker.detect(image);
console.log(data);
})();
worker.terminate
terminates the worker and cleans up
(async () => {
await worker.terminate();
})();
worker.writeText
writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js
to read file from file system.
Arguments:
path
text file pathtext
content of the text filejobId
Please see details aboveExamples:
(async () => {
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
})();
worker.readText
reads a text file to the path specified in MEMFS, it is useful when you want to check the content.
Arguments:
path
text file pathjobId
Please see details aboveExamples:
(async () => {
const { data } = await worker.readText('tmp.txt');
console.log(data);
})();
worker.removeFile
removes a file in MEMFS, it is useful when you want to free the memory.
Arguments:
path
file pathjobId
Please see details aboveExamples:
(async () => {
await worker.removeFile('tmp.txt');
})();
worker.FS
is a generic FS function to do anything you want, you can check HERE for all functions.
Arguments:
method
method nameargs
array of arguments to passjobId
Please see details aboveExamples:
(async () => {
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']);
// equal to:
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
})();
createScheduler
is a factory function to create a scheduler, a scheduler manages a job queue and workers to enable multiple workers to work together, it is useful when you want to speed up your performance.
Examples:
const { createScheduler } = Tesseract;
const scheduler = createScheduler();
scheduler.addWorker
adds a worker into the worker pool inside scheduler, it is suggested to add one worker to only one scheduler.
Arguments:
worker
see Worker aboveExamples:
const { createWorker, createScheduler } = Tesseract;
const scheduler = createScheduler();
const worker = await createWorker();
scheduler.addWorker(worker);
scheduler.addJob
adds a job to the job queue and scheduler waits and finds an idle worker to take the job.
Arguments:
action
a string to indicate the action you want to do, right now only recognize and detect are supportedpayload
a arbitrary number of args depending on the action you called.Examples:
(async () => {
const { data: { text } } = await scheduler.addJob('recognize', image, options);
const { data } = await scheduler.addJob('detect', image);
})();
scheduler.getQueueLen()
returns the length of job queue.
Scheduler.getNumWorkers()
returns number of workers added into the scheduler
scheduler.terminate()
terminates all workers added, useful to do quick clean up.
Examples:
(async () => {
await scheduler.terminate();
})();
setLogging
sets the logging flag, you can setLogging(true)
to see detailed information, useful for debugging.
Arguments:
logging
boolean to define whether to see detailed logs, default: falseExamples:
const { setLogging } = Tesseract;
setLogging(true);
[!WARNING]
This function is depreciated and should be replaced withworker.recognize
(see above).
recognize
works the same as worker.recognize
, except that a new worker is created, loaded, and destroyed every time the function is called.
See Tesseract.js
[!WARNING]
This function is depreciated and should be replaced withworker.detect
(see above).
detect
works the same as worker.detect
, except that a new worker is created, loaded, and destroyed every time the function is called.
See Tesseract.js
See PSM.js
See OEM.js