Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drive IO stats #891

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Drive IO stats #891

wants to merge 7 commits into from

Conversation

r-scheele
Copy link

Implemented new metrics in publishDriveStats:

  • Online/Offline status of drives.
  • Total bytes read and written.
  • Read and write latency calculations.
  • Read and write throughput.
  • Drive wait time.
    Similar to the existing implementation publishVolumeStats

Copy link
Collaborator

@Praveenrajmani Praveenrajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix the lint errors? @r-scheele

pkg/metrics/collector.go Outdated Show resolved Hide resolved
return &stats, nil
}

func getSectorSize(device string) int64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let this function return an (error, int64)

func (c *metricsCollector) publishDriveStats(ctx context.Context, drive *types.Drive, ch chan<- prometheus.Metric) {
device, err := c.getDeviceByFSUUID(drive.Status.FSUUID)
if err != nil {
klog.ErrorS(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the device is not found by its FSUUID, this means the drive is not READY. we should also export this information as a metric.

// Drive is offline
ch <- prometheus.MustNewConstMetric(
prometheus.NewDesc(
prometheus.BuildFQName(consts.AppName, "stats", "drive_status"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have _drive_ready instead of _drive_status?

if these directories are not present, then it means the drive is "not ok" (or) "not ready". we can have a label name accordingly.

Copy link
Member

@balamurugana balamurugana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add respective functional tests

@@ -11,3 +11,4 @@ dist/
kustomize
operator-sdk
kubectl-directpv_*
directpv.tar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As none of our build tool creates this file, could you explain why do we need to ignore this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was added to the .gitignore because it is generated during testing.
https://github.com/minio/directpv/blob/master/docs/developer.md the image tar ball was created by podman

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line

ReadTicks int64
WriteSectors int64
WriteTicks int64
TimeInQueue int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add a function in pkg/device/sysfs_linux.go like below
func GetStat(name string) (stats []uint64, err error) {
	line, err := readFirstLine("/sys/class/block/" + name + "/stat")
	if err != nil {
		return nil, err
	}

	for _, token := range strings.Split(line, " ") {
		token = strings.TrimSpace(token)
		ui64, err := strconv.ParseUint(token, 10, 64)
		if err != nil {
			return nil, err
		}
		stats = append(stats, ui64)
	}

	return stats, nil
}

and use it here like

func getDriveStats(driveName string) (*driveStats, error) {
	stats, err := device.GetStat(driveName)
	if err != nil {
		return nil, err
	}
	if len(stats) == 0 {
		// The drive is lost
		return nil, nil
	}

	if len(stats) < 10 {
		return nil, fmt.Errorf("unknown stats from sysfs for drive %v", driveName)
	}

	// Refer https://www.kernel.org/doc/Documentation/block/stat.txt
	// for meaning of each field.
	return &driveStats{
		ReadSectors:  stats[2],
		ReadTicks:    stats[3],
		WriteSectors: stats[6],
		WriteTicks:   stats[7],
		TimeInQueue:  stats[9],
	}, nil
}
  1. unexport/make private all the fields in driveStats struct

"Drive Online/Offline Status",
[]string{"drive"}, nil),
prometheus.GaugeValue,
0, // 0 for offline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean boolean value here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get you correctly, I'm using 0 to show offline and 1 to show online. As it seems more Boolean-like. Prometheus does not support a native Boolean data type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the other way to do to indicate boolean state? If none, default constants for drive online/offline and user it here.


return
}
device = strings.TrimPrefix(device, "/dev")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
device = strings.TrimPrefix(device, "/dev")
// Online/Offline Status
if _, err := os.Stat("/sys/block" + device); os.IsNotExist(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As mentioned previously, this won't work for disk partition used as DirectPVDrive. You have to use /sys/class/block instead of /sys/block for both disk and partition
  2. Use path.Join() instead of +
  3. sysfs is unreliable to check drive existent on the system i.e. in some systems sysfs entry is present but actual drive is gone from the system

)
}

filePath := "/sys/block" + device + "/stat"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use path.Join()


func getSectorSize(device string) (int64, error) {
// Construct the file path to the sector size file
filePath := "/sys/block/" + device + "/queue/hw_sector_size"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This file doesn't exist for disk partitions. For partitions, find it is disk and fetch sector size from it and use it.
  2. if /sys/class/block/<device>/queue/hw_sector_size is missing for disk, use 512 as default block size.
  3. Move this function as GetHardwareSectorSize to pkg/device/sysfs_linux.go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a defaultBlockSize as 512 in the sysfs_linux.go, Since it's the same value, should I go ahead and use it? but I'm thinking block size and sector size aren't the same thing, I could create a new constant just for sector size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the system have default block/sector size as 512. you could define default sector size as default block size.

if result.Err != nil {
return
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why continue? do you mean break?

List(ctx)
for result := range driveResultCh {
if result.Err != nil {
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why continue? do you mean break?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used continue to ensure we process all drives, even if an error occurs on one of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the error is thrown, we cannot get values further. break is logically correct thing to do.

@r-scheele
Copy link
Author

Please add respective functional tests

Please add respective functional tests

Yes i will, I'm going through the previous one written already.

@Praveenrajmani
Copy link
Collaborator

PTAL @r-scheele

@r-scheele
Copy link
Author

PTAL @r-scheele

Hi @Praveenrajmani yes, i'm working on this review now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants