prebid-integration-monitor

This project is designed to monitor Prebid integrations. It is now built using TypeScript and uses Vitest for testing.

Prerequisites

Node.js (v16 or later recommended)
npm (comes with Node.js)

Setup

Clone the repository:

git clone https://github.com/prebid/prebid-integration-monitor.git
cd prebid-integration-monitor

Install dependencies:
```
npm install
```

Project Structure

bin/: Contains executable scripts for running the CLI.
- dev.js: Entry point for development mode (uses ts-node).
- run.js: Entry point for production mode (uses compiled JS).
src/: Contains the TypeScript source code for the application.
- commands/: Contains the oclif command classes.
  - index.ts: The default command that runs when no specific command is provided. It now houses the main application logic.
- index.ts: Previously the main entry point. Now, its role is superseded by the oclif structure in bin/ and src/commands/.
- cluster.cts: Handles Puppeteer cluster setup (uses CommonTS syntax).
- utils/: Utility scripts.
- Other .ts files for application logic.
dist/: Contains the compiled JavaScript code, generated by npm run build.
- dist/commands/: Contains compiled oclif commands.
tests/: Contains test files written using Vitest.
- example.test.ts: An example test file.
package.json: Lists project dependencies, npm scripts, and oclif configuration.
tsconfig.json: Configuration file for the TypeScript compiler.
node_modules/: Directory where npm installs project dependencies.

Development

To run the application in development mode (using the oclif development script, which leverages ts-node):

npm run dev

This executes node ./bin/dev.js which handles TypeScript execution.

Building for Production

To compile the TypeScript code to JavaScript (output will be in the dist/ directory):

npm run build

Note on tsc execution: In some environments, if tsc (the TypeScript compiler) is not found via the npm script, you might need to invoke it using npx:

npx -p typescript tsc

This command typically isn't needed if npm run build works, as npm run build should use the tsc from your project's devDependencies. The type checking can also be done using npm run build:check or npx -p typescript tsc --noEmit --module nodenext --moduleResolution nodenext src/**/*.ts src/cluster.cts.

Running in Production

After building the project (npm run build), run the compiled application using its oclif entry point:

npm start

This executes node ./bin/run.js.

CLI Usage (oclif)

This application is now structured as an oclif (Open CLI Framework) command-line interface.

Running commands:
- In development: node ./bin/dev.js [COMMAND] [FLAGS]
- In production (after npm run build): node ./bin/run.js [COMMAND] [FLAGS]
- If the package is linked globally (npm link) or installed globally, you can use the bin name directly: app [COMMAND] [FLAGS] (Note: app is the default bin name configured in package.json).
Default Command:
- Running node ./bin/run.js (or node ./bin/dev.js) without any specific command will execute the default command defined in src/commands/index.ts. This command currently runs the main prebid integration monitoring logic.
Getting Help:
- To see general help for the CLI and a list of available commands:
```
node ./bin/run.js --help
```
  or in development:
```
node ./bin/dev.js --help
```
- For help on a specific command:
```
node ./bin/run.js [COMMAND] --help
```

`scan` Command

The scan command is used to analyze a list of websites for Prebid.js integrations and other specified ad technology libraries. It processes URLs from an input file, launches Puppeteer instances to visit these pages, and collects data, saving the results to an output directory.

Syntax:

./bin/run scan [INPUTFILE] [FLAGS...]
# or in development: node ./bin/dev.js scan [INPUTFILE] [FLAGS...]
# or using npm script: npm run prebid:scan -- [INPUTFILE] [FLAGS...] (note the -- to pass arguments to the script)

Argument:

INPUTFILE: (Optional) Path to an input file containing a list of URLs to scan (one URL per line).
- This argument is required if --githubRepo is not used.
- If --githubRepo is provided, INPUTFILE is ignored.
- If neither INPUTFILE nor --githubRepo is specified, the command will show an error.
- Previously defaulted to src/input.txt; now, an input source must be explicitly defined if not using a default that might be configured elsewhere or if src/input.txt is not present. (Note: The CLI was updated to require either inputFile or githubRepo explicitly in scan.ts)

Flags:

--githubRepo <URL>: Specifies a public GitHub URL from which to fetch URLs.
- This can be a base repository URL (e.g., https://github.com/owner/repo) to scan for URLs within processable files (like .txt, .md) in the repository root.
- Alternatively, it can be a direct link to a specific processable file within a repository (e.g., https://github.com/owner/repo/blob/main/some/path/file.txt). In this case, only the specified file will be fetched and processed.
- Example (repository): --githubRepo https://github.com/owner/repo
- Example (direct file): --githubRepo https://github.com/owner/repo/blob/main/urls.txt
--csvFile <path_or_url>: Path to a local CSV file or a URL (e.g., a raw GitHub CSV link) from which to load URLs for scanning. The scanner expects URLs to be in the first column of the CSV. This flag takes precedence over --githubRepo and the INPUTFILE argument.
- Example (local): --csvFile ./path/to/your/urls.csv
- Example (remote): --csvFile https://raw.githubusercontent.com/user/repo/main/path/to/your/urls.csv
--numUrls <number>: When used with --githubRepo, this flag limits the number of URLs to be extracted and processed from the repository. (Note: This flag does not currently apply to URLs loaded via --csvFile.)
- Default: 100
- Example: --numUrls 50
--puppeteerType <option>: Specifies the Puppeteer operational mode.
- Options: vanilla, cluster (default)
- vanilla: Processes URLs sequentially using a single Puppeteer browser instance.
- cluster: Uses puppeteer-cluster to process URLs in parallel, according to the concurrency settings.
--concurrency <value>: Sets the number of concurrent Puppeteer instances when using puppeteerType=cluster.
- Default: 5
--headless: Runs Puppeteer in headless mode (no UI). This is the default.
- --no-headless: Runs Puppeteer with a visible browser UI.
--monitor: Enables puppeteer-cluster's web monitoring interface (available at http://localhost:21337 by default) when puppeteerType=cluster.
- Default: false
--outputDir <value>: Specifies the directory where scan results (JSON files) will be saved.
- Default: output (in the project root)
- Results are typically saved in a subdirectory structure like output/Month/YYYY-MM-DD.json.
--logDir <value>: Specifies the directory where log files (app.log, error.log) will be saved.
- Default: logs (in the project root)
--range <string>: Specify a line range (e.g., '10-20', '5-', '-15') to process from the input source (file, CSV, or GitHub-extracted URLs). Uses 1-based indexing. If the source is a GitHub repo, the range applies to the aggregated list of URLs extracted from all targeted files in the repo.
- Example: --range 10-50 or --range 1-
--chunkSize <number>: Process URLs in chunks of this size. This processes all URLs (whether from the full input or a specified range) but does so by loading and analyzing only chunkSize URLs at a time. Useful for very large lists of URLs to manage resources or process incrementally.
- Example: --chunkSize 50

Usage Examples:

Basic scan (using default input.txt and cluster mode):
```
./bin/run scan
```
(Ensure ./bin/run has execute permissions or use node ./bin/run scan)
Scan using vanilla Puppeteer:
```
./bin/run scan --puppeteerType=vanilla
```
Scan with a specific input file and output directory:
```
./bin/run scan my_urls.txt --outputDir=./my_results
```
Scan in non-headless (headed) mode:
```
./bin/run scan --no-headless
```
Scan with increased concurrency and monitoring for cluster mode:
```
./bin/run scan --concurrency=10 --monitor
```

Scan URLs from a GitHub repository:

./bin/run scan --githubRepo https://github.com/owner/repo

Scan a limited number of URLs from a GitHub repository:

./bin/run scan --githubRepo https://github.com/owner/repo --numUrls 50

Scan URLs from a local CSV file:

./bin/run scan --csvFile ./data/urls_to_scan.csv

Scan URLs from a remote CSV file (raw GitHub link):

./bin/run scan --csvFile https://raw.githubusercontent.com/prebid/prebid-integration-monitor/main/tests/fixtures/sample_urls.csv

Scan a specific range of URLs from a large input file, in chunks:

./bin/run scan very_large_list_of_sites.txt --range 1001-2000 --chunkSize 100

Notes on URL Extraction

The scanner specifically looks for URLs that begin with http:// or https://.
Entries in input sources (files or GitHub content) that are malformed (e.g., htp://missing-t.com), schemeless (e.g., example.com without a leading http:// or https://), or plain text will be skipped and not processed as URLs.
The tool is designed to robustly process valid URLs even when they are mixed with such non-URL or malformed entries in the source content.

`stats:generate` Command

The stats:generate command processes stored website scan data (typically found in the ./store directory, generated by the scan command) to generate or update the api/api.json file. This JSON file contains aggregated statistics about Prebid.js usage, including version distributions and module usage, after cleaning and categorization.

Syntax:

./bin/run stats:generate
# or in development: node ./bin/dev.js stats:generate
# or using npm script (if configured): npm run prebid:stats:generate

Flags:

This command currently does not take any specific flags.

Usage Example:

Generate or update the statistics API file:
```
./bin/run stats:generate
```
(Ensure ./bin/run has execute permissions or use node ./bin/run stats:generate)

Running Tests

To run the test suite using Vitest:

npm test

Note on vitest execution: In some environments, if vitest is not found via the npm script, you might need to run it using npx:

npx -p vitest vitest run

This will execute all tests found in the tests/ directory.

Logging

This application utilizes the Winston logging library to provide detailed and structured logging.

Log Files

Log files are stored in the logs/ directory, which is excluded from Git commits via .gitignore.

logs/app.log: This file contains all general application logs, including informational messages, warnings, and errors (typically info level and above). Log entries are stored in JSON format, allowing for easy parsing and querying. Each entry includes a timestamp, log level, message, and any additional metadata.
logs/error.log: This file is dedicated to error-level logs only. It provides a focused view of errors that have occurred within the application, also in JSON format. Error logs will include stack traces when available.

Console Output

In addition to file logging, messages are also output to the console:

Log messages are colorized for better readability based on their severity level (e.g., errors in red, warnings in yellow).
The console typically displays info level messages and above.
The format includes the timestamp, log level, and message, similar to the file logs.

Log Format

All log entries, whether in files or on the console, follow a consistent format:

Timestamp: YYYY-MM-DD HH:mm:ss
Level: The severity of the log (e.g., info, warn, error).
Message: The main content of the log entry.
Stack Trace: For error logs, a stack trace is included if available, aiding in debugging.

OpenTelemetry Integration

OpenTelemetry has been integrated to provide distributed tracing capabilities.

The main tracer configuration can be found in src/tracer.ts.
Log messages (both console and file) are now automatically enriched with trace_id and span_id when generated within an active trace.
The default OTLP HTTP exporter is used. You may need to configure the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to point to your OpenTelemetry collector (e.g., http://localhost:4318/v1/traces).
The service name for OpenTelemetry is configured via the OTEL_SERVICE_NAME environment variable (e.g., export OTEL_SERVICE_NAME="prebid-integration-monitor"). An attempt to set this directly in src/tracer.ts using the Resource attribute resulted in a TypeScript build error (TS2693) due to OpenTelemetry package version incompatibilities, so it remains commented out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

prebid-integration-monitor

Prerequisites

Setup

Project Structure

Development

Building for Production

Running in Production

CLI Usage (oclif)

`scan` Command

Notes on URL Extraction

`stats:generate` Command

Running Tests

Logging

Log Files

Console Output

Log Format

OpenTelemetry Integration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
api		api
bin		bin
dist		dist
errors		errors
node_modules		node_modules
src		src
store		store
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
known_crawler_lists		known_crawler_lists
oclif.manifest.json		oclif.manifest.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

License

prebid/prebid-integration-monitor

Folders and files

Latest commit

History

Repository files navigation

prebid-integration-monitor

Prerequisites

Setup

Project Structure

Development

Building for Production

Running in Production

CLI Usage (oclif)

scan Command

Notes on URL Extraction

stats:generate Command

Running Tests

Logging

Log Files

Console Output

Log Format

OpenTelemetry Integration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

`scan` Command

`stats:generate` Command

Packages