This project is designed to monitor Prebid integrations. It is now built using TypeScript and uses Vitest for testing.
- Node.js (v16 or later recommended)
- npm (comes with Node.js)
-
Clone the repository:
git clone https://github.com/prebid/prebid-integration-monitor.git cd prebid-integration-monitor
-
Install dependencies:
npm install
bin/
: Contains executable scripts for running the CLI.dev.js
: Entry point for development mode (uses ts-node).run.js
: Entry point for production mode (uses compiled JS).
src/
: Contains the TypeScript source code for the application.commands/
: Contains the oclif command classes.index.ts
: The default command that runs when no specific command is provided. It now houses the main application logic.
index.ts
: Previously the main entry point. Now, its role is superseded by the oclif structure inbin/
andsrc/commands/
.cluster.cts
: Handles Puppeteer cluster setup (uses CommonTS syntax).utils/
: Utility scripts.- Other
.ts
files for application logic.
dist/
: Contains the compiled JavaScript code, generated bynpm run build
.dist/commands/
: Contains compiled oclif commands.
tests/
: Contains test files written using Vitest.example.test.ts
: An example test file.
package.json
: Lists project dependencies, npm scripts, and oclif configuration.tsconfig.json
: Configuration file for the TypeScript compiler.node_modules/
: Directory where npm installs project dependencies.
To run the application in development mode (using the oclif development script, which leverages ts-node
):
npm run dev
This executes node ./bin/dev.js
which handles TypeScript execution.
To compile the TypeScript code to JavaScript (output will be in the dist/
directory):
npm run build
Note on tsc
execution: In some environments, if tsc
(the TypeScript compiler) is not found via the npm script, you might need to invoke it using npx
:
npx -p typescript tsc
This command typically isn't needed if npm run build
works, as npm run build
should use the tsc
from your project's devDependencies
. The type checking can also be done using npm run build:check
or npx -p typescript tsc --noEmit --module nodenext --moduleResolution nodenext src/**/*.ts src/cluster.cts
.
After building the project (npm run build
), run the compiled application using its oclif entry point:
npm start
This executes node ./bin/run.js
.
This application is now structured as an oclif (Open CLI Framework) command-line interface.
-
Running commands:
- In development:
node ./bin/dev.js [COMMAND] [FLAGS]
- In production (after
npm run build
):node ./bin/run.js [COMMAND] [FLAGS]
- If the package is linked globally (
npm link
) or installed globally, you can use the bin name directly:app [COMMAND] [FLAGS]
(Note:app
is the default bin name configured inpackage.json
).
- In development:
-
Default Command:
- Running
node ./bin/run.js
(ornode ./bin/dev.js
) without any specific command will execute the default command defined insrc/commands/index.ts
. This command currently runs the main prebid integration monitoring logic.
- Running
-
Getting Help:
- To see general help for the CLI and a list of available commands:
or in development:
node ./bin/run.js --help
node ./bin/dev.js --help
- For help on a specific command:
node ./bin/run.js [COMMAND] --help
- To see general help for the CLI and a list of available commands:
The scan
command is used to analyze a list of websites for Prebid.js integrations and other specified ad technology libraries. It processes URLs from an input file, launches Puppeteer instances to visit these pages, and collects data, saving the results to an output directory.
Syntax:
./bin/run scan [INPUTFILE] [FLAGS...]
# or in development: node ./bin/dev.js scan [INPUTFILE] [FLAGS...]
# or using npm script: npm run prebid:scan -- [INPUTFILE] [FLAGS...] (note the -- to pass arguments to the script)
Argument:
INPUTFILE
: (Optional) Path to an input file containing a list of URLs to scan (one URL per line).- This argument is required if
--githubRepo
is not used. - If
--githubRepo
is provided,INPUTFILE
is ignored. - If neither
INPUTFILE
nor--githubRepo
is specified, the command will show an error. - Previously defaulted to
src/input.txt
; now, an input source must be explicitly defined if not using a default that might be configured elsewhere or ifsrc/input.txt
is not present. (Note: The CLI was updated to require either inputFile or githubRepo explicitly inscan.ts
)
- This argument is required if
Flags:
--githubRepo <URL>
: Specifies a public GitHub URL from which to fetch URLs.- This can be a base repository URL (e.g.,
https://github.com/owner/repo
) to scan for URLs within processable files (like.txt
,.md
) in the repository root. - Alternatively, it can be a direct link to a specific processable file within a repository (e.g.,
https://github.com/owner/repo/blob/main/some/path/file.txt
). In this case, only the specified file will be fetched and processed. - Example (repository):
--githubRepo https://github.com/owner/repo
- Example (direct file):
--githubRepo https://github.com/owner/repo/blob/main/urls.txt
- This can be a base repository URL (e.g.,
--csvFile <path_or_url>
: Path to a local CSV file or a URL (e.g., a raw GitHub CSV link) from which to load URLs for scanning. The scanner expects URLs to be in the first column of the CSV. This flag takes precedence over--githubRepo
and theINPUTFILE
argument.- Example (local):
--csvFile ./path/to/your/urls.csv
- Example (remote):
--csvFile https://raw.githubusercontent.com/user/repo/main/path/to/your/urls.csv
- Example (local):
--numUrls <number>
: When used with--githubRepo
, this flag limits the number of URLs to be extracted and processed from the repository. (Note: This flag does not currently apply to URLs loaded via--csvFile
.)- Default:
100
- Example:
--numUrls 50
- Default:
--puppeteerType <option>
: Specifies the Puppeteer operational mode.- Options:
vanilla
,cluster
(default) vanilla
: Processes URLs sequentially using a single Puppeteer browser instance.cluster
: Usespuppeteer-cluster
to process URLs in parallel, according to the concurrency settings.
- Options:
--concurrency <value>
: Sets the number of concurrent Puppeteer instances when usingpuppeteerType=cluster
.- Default:
5
- Default:
--headless
: Runs Puppeteer in headless mode (no UI). This is the default.--no-headless
: Runs Puppeteer with a visible browser UI.
--monitor
: Enablespuppeteer-cluster
's web monitoring interface (available athttp://localhost:21337
by default) whenpuppeteerType=cluster
.- Default:
false
- Default:
--outputDir <value>
: Specifies the directory where scan results (JSON files) will be saved.- Default:
output
(in the project root) - Results are typically saved in a subdirectory structure like
output/Month/YYYY-MM-DD.json
.
- Default:
--logDir <value>
: Specifies the directory where log files (app.log
,error.log
) will be saved.- Default:
logs
(in the project root)
- Default:
--range <string>
: Specify a line range (e.g., '10-20', '5-', '-15') to process from the input source (file, CSV, or GitHub-extracted URLs). Uses 1-based indexing. If the source is a GitHub repo, the range applies to the aggregated list of URLs extracted from all targeted files in the repo.- Example:
--range 10-50
or--range 1-
- Example:
--chunkSize <number>
: Process URLs in chunks of this size. This processes all URLs (whether from the full input or a specified range) but does so by loading and analyzing onlychunkSize
URLs at a time. Useful for very large lists of URLs to manage resources or process incrementally.- Example:
--chunkSize 50
- Example:
Usage Examples:
-
Basic scan (using default
input.txt
and cluster mode):./bin/run scan
(Ensure
./bin/run
has execute permissions or usenode ./bin/run scan
) -
Scan using vanilla Puppeteer:
./bin/run scan --puppeteerType=vanilla
-
Scan with a specific input file and output directory:
./bin/run scan my_urls.txt --outputDir=./my_results
-
Scan in non-headless (headed) mode:
./bin/run scan --no-headless
-
Scan with increased concurrency and monitoring for cluster mode:
./bin/run scan --concurrency=10 --monitor
-
Scan URLs from a GitHub repository:
./bin/run scan --githubRepo https://github.com/owner/repo
-
Scan a limited number of URLs from a GitHub repository:
./bin/run scan --githubRepo https://github.com/owner/repo --numUrls 50
-
Scan URLs from a local CSV file:
./bin/run scan --csvFile ./data/urls_to_scan.csv
-
Scan URLs from a remote CSV file (raw GitHub link):
./bin/run scan --csvFile https://raw.githubusercontent.com/prebid/prebid-integration-monitor/main/tests/fixtures/sample_urls.csv
-
Scan a specific range of URLs from a large input file, in chunks:
./bin/run scan very_large_list_of_sites.txt --range 1001-2000 --chunkSize 100
- The scanner specifically looks for URLs that begin with
http://
orhttps://
. - Entries in input sources (files or GitHub content) that are malformed (e.g.,
htp://missing-t.com
), schemeless (e.g.,example.com
without a leadinghttp://
orhttps://
), or plain text will be skipped and not processed as URLs. - The tool is designed to robustly process valid URLs even when they are mixed with such non-URL or malformed entries in the source content.
The stats:generate
command processes stored website scan data (typically found in the ./store
directory, generated by the scan
command) to generate or update the api/api.json
file. This JSON file contains aggregated statistics about Prebid.js usage, including version distributions and module usage, after cleaning and categorization.
Syntax:
./bin/run stats:generate
# or in development: node ./bin/dev.js stats:generate
# or using npm script (if configured): npm run prebid:stats:generate
Flags:
This command currently does not take any specific flags.
Usage Example:
- Generate or update the statistics API file:
(Ensure
./bin/run stats:generate
./bin/run
has execute permissions or usenode ./bin/run stats:generate
)
To run the test suite using Vitest:
npm test
Note on vitest
execution: In some environments, if vitest
is not found via the npm script, you might need to run it using npx
:
npx -p vitest vitest run
This will execute all tests found in the tests/
directory.
This application utilizes the Winston logging library to provide detailed and structured logging.
Log files are stored in the logs/
directory, which is excluded from Git commits via .gitignore
.
logs/app.log
: This file contains all general application logs, including informational messages, warnings, and errors (typicallyinfo
level and above). Log entries are stored in JSON format, allowing for easy parsing and querying. Each entry includes a timestamp, log level, message, and any additional metadata.logs/error.log
: This file is dedicated to error-level logs only. It provides a focused view of errors that have occurred within the application, also in JSON format. Error logs will include stack traces when available.
In addition to file logging, messages are also output to the console:
- Log messages are colorized for better readability based on their severity level (e.g., errors in red, warnings in yellow).
- The console typically displays
info
level messages and above. - The format includes the timestamp, log level, and message, similar to the file logs.
All log entries, whether in files or on the console, follow a consistent format:
- Timestamp:
YYYY-MM-DD HH:mm:ss
- Level: The severity of the log (e.g.,
info
,warn
,error
). - Message: The main content of the log entry.
- Stack Trace: For error logs, a stack trace is included if available, aiding in debugging.
OpenTelemetry has been integrated to provide distributed tracing capabilities.
- The main tracer configuration can be found in
src/tracer.ts
. - Log messages (both console and file) are now automatically enriched with
trace_id
andspan_id
when generated within an active trace. - The default OTLP HTTP exporter is used. You may need to configure the
OTEL_EXPORTER_OTLP_ENDPOINT
environment variable to point to your OpenTelemetry collector (e.g.,http://localhost:4318/v1/traces
). - The service name for OpenTelemetry is configured via the
OTEL_SERVICE_NAME
environment variable (e.g.,export OTEL_SERVICE_NAME="prebid-integration-monitor"
). An attempt to set this directly insrc/tracer.ts
using theResource
attribute resulted in a TypeScript build error (TS2693) due to OpenTelemetry package version incompatibilities, so it remains commented out.