Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix using unavailable worker cached in client-pool to resolve #18693 #18715

Open
wants to merge 1,157 commits into
base: main
Choose a base branch
from

Conversation

yws-tracy
Copy link

What changes are proposed in this pull request?
to fix #18693

Why are the changes needed?
In ClientPool, ClientPoolKey only takes ip into consideration, which is not enough, it can cause incorrect worker cache in clientPool. To fix the problem, hostName should be taken into consideration to compare ClientPoolKey as well

Does this PR introduce any user facing changes?
No

Xenorith and others added 30 commits September 10, 2023 22:26
### What changes are proposed in this pull request?

Remove hostname from metrics key

### Why are the changes needed?

For easy aggregation on prometheus and grafana side

### Does this PR introduce any user facing changes?

Add a flag to disable this for compatibility

			pr-link: Alluxio#18121
			change-id: cid-ba6c2f9fae625747192044168fce7dc026c66b9c
besides the User-CLI.md doc, update other doc files that refer to `bin/alluxio` commands
- remove docs on path config
- remove starting/stopping job master/worker from contributor docs
			pr-link: Alluxio#18128
			change-id: cid-fc71dd493b16ef3aeeb0b1b190941c43b9af9cab
What changes are proposed in this pull request?
I have created a test and create a liststatus test for its function.

Why are the changes needed?
Please clarify why the changes are needed. For instance,

add a unit test for DoraWorkerClientServiceHandler.
Does this PR introduce any user facing changes?
No.
			pr-link: Alluxio#18059
			change-id: cid-b82706a4419700f017584f3e5579d2ef3410aeb3
Make fuse max reader concurrency configurable. The default value was 64 and it was unchangeable.
			pr-link: Alluxio#18129
			change-id: cid-9c55821622329bd1e608da2e7445e8ab591df38a
Fix typo from alluxio.max.fuse.reader.concurrency to alluxio.fuse.max.reader.concurrency
			pr-link: Alluxio#18134
			change-id: cid-434086cf6ba9e9f8d173e3417fc8518963dfa102
update usages of bin/alluxio, bin/alluxio-start.sh and bin/alluxio-stop.sh to their new counterparts

simplify section of CephFS.md and remove sections related to mounting. the ufs must be configured as the root mount via alluxio-site.properties.
			pr-link: Alluxio#18136
			change-id: cid-fa7d0eec00c8fb136680ef6d5a2c7ee78571d123
### What changes are proposed in this pull request?

Support accessing OSS through proxy by configuring alluxio properties or system properties.

### Why are the changes needed?

When accessing OSS through a proxy, the OSS client cannot recognize the proxy configuration in system property and environment variables. So it has to proactively set proxy-related configurations in the configuration.

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
  1. change in user-facing APIs
  2. addition or removal of property keys
  3. webui

			pr-link: Alluxio#18139
			change-id: cid-5e30dfd90747d4a1aafe9b2ff985331f05fefec6
### What changes are proposed in this pull request?

If don't set oss.proxy.host, the default value should be NULL

### Why are the changes needed?

If don't set oss.proxy.host, the default value should be NULL

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
  1. change in user-facing APIs
  2. addition or removal of property keys
  3. webui

			pr-link: Alluxio#18142
			change-id: cid-bce2790e583445c4ba6720d2f0a64551fb19de20
### What changes are proposed in this pull request?

In this change the path conversion logic is extracted to static utility methods for code reuse (because other classes may use the same path resolution logic). 

The method names are slightly improved, to distinguish the member methods in `DoraCacheFileSystem` (may be inherited) from the static utility methods.
			pr-link: Alluxio#18140
			change-id: cid-557fa148f6daa41f0b296132e5a2ecae6c5d6c22
			pr-link: Alluxio#18138
			change-id: cid-e9f862385bdfe6cc5e6938eb49907055449deb4a
### What changes are proposed in this pull request?
1. Move the path resolution logic from `DoraCacheFileSystem` to `PathUtils` where it makes more sense
2. Fix the alluxioPathToUfsPath resolution by handling the case where the ufs path may have no matching alluxio path, making the util method more generic
			pr-link: Alluxio#18146
			change-id: cid-ebace1efcf58e385bbf71b599e4b5a15a2199f7e
### What changes are proposed in this pull request?

Please outline the changes and how this PR fixes the issue.

### Why are the changes needed?

Better error handling when a stream is closed. Log more and don't swallow errors

### Does this PR introduce any user facing changes?

N/A
			pr-link: Alluxio#18145
			change-id: cid-b1146421ea5bc51e9a5eea5bf11ce1a5d8466912
Use jackson JSON as a standard format for reports.

Output example for `bin/alluxio info report summary`:
```
{
    "mSafeMode":false,
    "mZookeeper":false,
    "mRaftJournal":true,
    "version":"304-SNAPSHOT",
    "uptime":"0 day(s), 0 hour(s), 3 minute(s), and 7 second(s)",
    "rpcPort":19998,
    "webPort":19999,
    "masterAddress":"Ec2Cluster-masters-0:19998",
    "masterVersions":[
        {
            "version":"304-SNAPSHOT",
            "state":"PRIMARY",
            "host":"Ec2Cluster-masters-0",
            "port":19998
        }
    ],
    "started":"08-24-2023 06:43:32:856",
    "zookeeperAddress":[

    ],
    "raftJournalAddress":[
        "Ec2Cluster-masters-0:19200"
    ],
    "liveWorkers":2,
    "lostWorkers":0,
    "freeCapacity":"2048.00MB",
    "totalCapacityOnTiers":{
        "MEM":"2048.00MB"
    },
    "usedCapacityOnTiers":{
        "MEM":"0B"
    }
}
```
			pr-link: Alluxio#18047
			change-id: cid-bf6d54f47390a4d2bd84e4baac2ea2863d4638e1
### What changes are proposed in this pull request?
refactor the cli code

### Why are the changes needed?
make the code more flexible and easy for adding more functions in a cleaner way

### Does this PR introduce any user facing changes?
nope

			pr-link: Alluxio#18152
			change-id: cid-d8f937075174d913daf32387d781161096f03345
1. Use `addSuppressed` instead of creating a new exception to throw the original exception.
2. Go ahead to write data to UFS when it encountered exception during the time writing data to Alluxio Worker.

			pr-link: Alluxio#18017
			change-id: cid-9337252b71e40fced28fb1598ea88eed56c69229
### What changes are proposed in this pull request?

Improve distributed load
1. Configurable job failure criteria 
2. Configuration to determine if the load job should be restored from journal or not
3. Add an option to skip existing fully loaded file 
4. Add retry count for failed files 
5. Bug fixing

### Why are the changes needed?

To enhance the distributed load tool

### Does this PR introduce any user facing changes?

Yes. The skip-if-exists option is added to the distributed load cli.
			pr-link: Alluxio#18153
			change-id: cid-5644da1c09bd6ee48f628552f51cb570de581b93
### What changes are proposed in this pull request?

Recover the ufs uri support

### Why are the changes needed?

Ufs uri should be the first class citizen in dora

### Does this PR introduce any user facing changes?

No

			pr-link: Alluxio#18135
			change-id: cid-a35fe39bdf69879ef113fc97737a35ebf6d8b29a
Add docker doc back
			pr-link: Alluxio#18130
			change-id: cid-6038dce4ae1821f0e7ccf0e2e874bed5d312057d
### What changes are proposed in this pull request?

Add a test so we can monitor the vnode distribution is not too uneven. 

This test calculates the standard deviation over mean on the collection of virtual nodes assigned to physical nodes. It arbitrarily bounds it at 0.25, but ideally this number should get smaller over time as we improve hashing algorithm
and use better ways to assign virtual nodes to physical nodes.


### Why are the changes needed?

We may change hashing algorithm and virtual node assignment in the future, this will provide guidance and catch errors. 

### Does this PR introduce any user facing changes?
No. 

			pr-link: Alluxio#18147
			change-id: cid-152d8edc9b65ef59967d5985849feeb471a6650d
Add the following CLI tools for debugging and analyzing caching issues:
1. checkCaching. Checks if files under a path have been cached in alluxio.
2. location. Displays the list of hosts storing the specified file.
3. consistentHash. This command is for checking whether the consistent hash ring is changed or not.
			pr-link: Alluxio#18151
			change-id: cid-c89b98da70a5270070d873bdcfce1aa9b23cf083
Update the Golang side commands to be able to use this output:
1. Return either the yaml or json (default) output to the console.
2. Users can define the format they want with `--format` flag, like `bin/alluxio info report --format yaml`
3. In JSON format, print properties in a fixed, easy-to-read order
4. In YAML format, print properties alphabetically (since YAML specification regards property order non-significant)

Before:
```
{"safeMode":false,"masterVersions":[{"version":"304-SNAPSHOT","host":"localhost","port":19998,"state":"PRIMARY"}],"masterAddress":"localhost:19998","zookeeperAddress":[],"useZookeeper":false,"raftJournalAddress":["localhost:19200"],"useRaftJournal":true,"liveWorkers":1,"lostWorkers":0,"freeCapacity":"1024.00MB","totalCapacityOnTiers":{"MEM":"1024.00MB"},"usedCapacityOnTiers":{"MEM":"0B"},"version":"304-SNAPSHOT","webPort":19999,"started":"09-15-2023 15:54:56:635","uptime":"0 day(s), 0 hour(s), 26 minute(s), and 37 second(s)","rpcPort":19998}
```

After (in JSON):
```
{
    "rpcPort": 19998,
    "started": "09-15-2023 15:54:56:635",
    "uptime": "0 day(s), 0 hour(s), 55 minute(s), and 31 second(s)",
    "safeMode": false,
    "version": "304-SNAPSHOT",
    "webPort": 19999,
    "masterVersions": [
        {
            "version": "304-SNAPSHOT",
            "host": "localhost",
            "port": 19998,
            "state": "PRIMARY"
        }
    ],
    "masterAddress": "localhost:19998",
    "zookeeperAddress": [],
    "useZookeeper": false,
    "raftJournalAddress": [
        "localhost:19200"
    ],
    "useRaftJournal": true,
    "liveWorkers": 1,
    "lostWorkers": 0,
    "freeCapacity": "1024.00MB",
    "totalCapacityOnTiers": {
        "MEM": "1024.00MB"
    },
    "usedCapacityOnTiers": {
        "MEM": "0B"
    }
}
```

After (in YAML):
```
freeCapacity: 1024.00MB
liveWorkers: 1
lostWorkers: 0
masterAddress: localhost:19998
masterVersions:
    - host: localhost
      port: 19998
      state: PRIMARY
      version: 304-SNAPSHOT
raftJournalAddress:
    - localhost:19200
rpcPort: 19998
safeMode: false
started: 09-15-2023 15:54:56:635
totalCapacityOnTiers:
    MEM: 1024.00MB
uptime: 0 day(s), 1 hour(s), 1 minute(s), and 36 second(s)
useRaftJournal: true
useZookeeper: false
usedCapacityOnTiers:
    MEM: 0B
version: 304-SNAPSHOT
webPort: 19999
zookeeperAddress: []
```
			pr-link: Alluxio#18159
			change-id: cid-deb6e74552de9afcf45391c6c230a9fe00785e37
### What changes are proposed in this pull request?

Add datePredicate, i.e.:
lastModifiedDate(2000/01/01, 2023/09/01)

### Why are the changes needed?

Customer requirement.

### Does this PR introduce any user facing changes?

na

			pr-link: Alluxio#18155
			change-id: cid-7e1a7b7d208747807b87502c9da854ddf0b8c7fc
For random reads, bytes read per file is not a constant any more. In spite of existing duration distribution, need a throughput distribution for better understanding of reading performance.

Also, when duration too long, grpc will receive huge size of output data. Should aggregate data points to transfer more datapoints with limited output size.

Group datapoints by threads and time slices:
Example:
```
nodeResults: {
  worker-0: {
    dataPoints: [
      data: [
        { // worker 0, thread 0, slice 0
          count: 1,
          iobytes: 33554432,
        }, { // worker 0, thread 0, other slices
          …
        }
      ], [ // worker 0, other workers
        …
      ]
    ]
    throughputPercentiles: […]
  },
  worker-1: { // other workers
    …
  }
}
```

Slice time with `--slice-size` flag, e.g. `--slice-size 1s`.
			pr-link: Alluxio#18149
			change-id: cid-ec8ed5a4f9eeaa86b1d86b6b449db4647d584823
### What changes are proposed in this pull request?

Add Unit Test for OSS, OBS and GCS

### Why are the changes needed?

Unit test is important for improving functions of Alluxio.

### Does this PR introduce any user facing changes?

No

			pr-link: Alluxio#17985
			change-id: cid-c757b8249c62e2ccf0483cb99436e33d351358a1
### What changes are proposed in this pull request?

Fix datePredicate so it would respect the interval specified in the policy.
Polish tests
### Why are the changes needed?

bug fix

### Does this PR introduce any user facing changes?
na

			pr-link: Alluxio#18167
			change-id: cid-b035b5cc31aa70a2f83de2c3f84ba49ed75f9fb5
- Add a new CLI command that iterates through the command tree and generates a markdown file based on each command's definition and description
- Migrate all the content in the previous User-CLI.md into the corresponding commands in golang code, mainly updating their `Long` and `Examples` fields
- Run `bin/alluxio generate user-cli` to write the generated content directly into `docs/en/operation/User-CLI.md`
			pr-link: Alluxio#18144
			change-id: cid-9b29f273efef9693e1b0b303c62cc19602d77acc
### What changes are proposed in this pull request?

Fix the broken PagedDoraWorkerTest.

### Why are the changes needed?

The old test is.broken, I just fix it.

### Does this PR introduce any user facing changes?

no.

			pr-link: Alluxio#18150
			change-id: cid-a142297c4f08780189e4321abc8e99ff512091ec
jja725 and others added 28 commits February 15, 2024 14:47
### What changes are proposed in this pull request?

2000 is too much

### Why are the changes needed?

2000 is too much

### Does this PR introduce any user facing changes?

na

			pr-link: Alluxio#18516
			change-id: cid-98762f4a176f30b7a83399183aef6f11d5113132
### What changes are proposed in this pull request?

Throw a PageCorruptedException when the length of page inconsistent with the metadata
Do our best efforts to delete the corrupted page file
Reset the offset of buffer when we found the data has been corrupted to avoid ArrayOutOfBound exception. 

### Why are the changes needed?
We found the cache make presto keep failing when some of the page file got corrupted

### Does this PR introduce any user facing changes?

No

			pr-link: Alluxio#18498
			change-id: cid-9012c5432e2b979f8242f3b247deed94b501d194
### What changes are proposed in this pull request?
Fix runtime stats when cache hit

### Why are the changes needed?

We already count the runtime stats in local cache manager, but we didn't pass in a proper cache context

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
  1. change in user-facing APIs
  2. addition or removal of property keys
  3. webui

			pr-link: Alluxio#18503
			change-id: cid-233916f6b410432f8300f10ac8a2385afe969f2c
Fix Unrecognized error

### What changes are proposed in this pull request?

Please outline the changes and how this PR fixes the issue.

### Why are the changes needed?

Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, describe the bug.

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
  1. change in user-facing APIs
  2. addition or removal of property keys
  3. webui

			pr-link: Alluxio#18527
			change-id: cid-9753dc51317f4ebf4ba50cd5463155e821191ac3
### What changes are proposed in this pull request?

Add decommission capability for worker and removenode cli cmd to remove node from etcd registration

### Why are the changes needed?

to add decommission capability for worker

### Does this PR introduce any user facing changes?

New cli added for remove worker node

			pr-link: Alluxio#18530
			change-id: cid-0d41896ef29c976d773a4dfa0b07bd5efd836115
### What changes are proposed in this pull request?

Port changes from  [@LetianYuan](https://github.com/LetianYuan) in PR : Alluxio#17256
in urgency for resolving injection vulnerability.

### Why are the changes needed?

Resolve issue:  https://nvd.nist.gov/vuln/detail/CVE-2023-38889

### Does this PR introduce any user facing changes?

No.

			pr-link: Alluxio#18536
			change-id: cid-a8122e3ff3324f1a9d881f55cc425ab30a020c35
### What changes are proposed in this pull request?

rebuild libjnifuse on Centos7 (glibc 2.17, libfuse 2.9.2) 
the build should work on Ubuntu with glibc 2.3x 

### Why are the changes needed?

libjnifuse.so fails to link to libfuse2, issue created here : Alluxio#18513
Introduced by a jnifuse change to rebuild the libjnifuse.so 


### Does this PR introduce any user facing changes?

fuse2 should be able to use now

			pr-link: Alluxio#18537
			change-id: cid-d5bd6421ccd76f59e6e1216ac9940c419c3c6a45
### What changes are proposed in this pull request?

Alluxio doesn't work with early-access (EA) JDKs which makes it hard to do some forward-testing before new JDK is released.

### Why are the changes needed?

Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, describe the bug.

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
  1. change in user-facing APIs
  2. addition or removal of property keys
  3. webui

			pr-link: Alluxio#18553
			change-id: cid-15c97cfab92f5ceafbc9051550cf596399ba02b9
### What changes are proposed in this pull request?
1. catch all exception thrown by fetch worker cluster view from etcd …when refresh the workers view instead of propagate throwing all the way to the caller
2. make worker failure detection timeout (the timeout to determine if a worker is in FAILED state) configurable thru setting the service discovery entity's lease ttl
3. make newLeaseInternal always overwrite the key with the newly created lease

### Why are the changes needed?

To resolve:
1) currently worker is considered down only 2sec after its disconnection with etcd, its too small, make the failure detection timeout configurable for registered service discovery services.
2) etcd unavaible runtime exception will be propagated to caller which is non-ideal, currently capture at FileSystemContext.getCachedWorkers layer to prevent propagating to IO layer, causing an unnecessary ufs fallback such as cold read.

### Does this PR introduce any user facing changes?

No

			pr-link: Alluxio#18554
			change-id: cid-3e87779f7cce10861e5b8a871030a89f9b828ab1
### What changes are proposed in this pull request?
When a old worker is down for couple days and up again with old data, the metrics Client.CachePagesAges would show high page age which exceed the ttl. And this violates the regulation.
This is caused by: 
1. we are not deleting old files when restoring local cache, and old data may still be able to serve users. 
2. when we are not able to run invalidate scheduled tasks when we are in restore state(initialDelay = 0).

In the meanwhile, fix a metrics calculation: CLIENT_CACHE_PAGES_INVALIDATED

### Why are the changes needed?
bug fix

### Does this PR introduce any user facing changes?
na

			pr-link: Alluxio#18568
			change-id: cid-c335697a018e0d57182045e96245f1145ed2494c
### What changes are proposed in this pull request?

Using name threads is generally a good practice and helps system administrators understand what's going on with the system. Without explicit naming, the JDK-provided pools jave threads named `pool-N-thread-T`.


			pr-link: Alluxio#18573
			change-id: cid-2fb9e71b75de62e7c6a0ead8c83037922957bf68
### What changes are proposed in this pull request?

Modified the logic of Alluxio clearing expired metadata.
The expiration time of RocksDB is set so that expired metadata can be cleared in time every time RocksDB performs a compaction operation.

### Why are the changes needed?

Alluxio has a problem that expired metadata cannot be cleared in time, which may lead to insufficient disk space in the long run.
We have modified the logic for clearing expired metadata to ensure that if the user sets an appropriate metadata expiration time, the disk will not continue to accumulate metadata that cannot be cleared.

### Does this PR introduce any user facing changes?

None

			pr-link: Alluxio#18582
			change-id: cid-40c506311a76a3c7585227ca2b5f488088ce9486
### What changes are proposed in this pull request?

Add HTTP getPage cache hit ratio. This is used for mimicing how much data read from alluxio cache and how much data possibly fallback to UFS given that no that much user get metrics from alluxiofs client.

### Why are the changes needed?

To get cache hit ratio of worker http server get page

### Does this PR introduce any user facing changes?

Yes, add some metrics

			pr-link: Alluxio#18594
			change-id: cid-e6ee6b8f2d5a751d72e804fe8e067b02cbf3ac2c
Install netcat in Dockerfile as needed.
			pr-link: Alluxio#18598
			change-id: cid-d2cc6b7c6d485a526e038c2faa94830933d3f716
Add a path to the worker http server to enable an easy health check.

			pr-link: Alluxio#18596
			change-id: cid-e12e3a2d1f91152f0af5cb598ba49c65a8abe42c
Add health check RESTful API in Worker HTTP Server.
			pr-link: Alluxio#18604
			change-id: cid-9a2a13e98df8e6f5c7298e4cd32af2747dc365da
### What changes are proposed in this pull request?

Fix metrics when delete non existing page

### Why are the changes needed?
When deleting non-existing page we see a abnormal spike in PageAge. Still doesn't know why we have abnormal PageAge but since it's non-existing page we would not able to read it as well so it make sense to not record the age.
Also we continue to delete page from page store if we didn't find it in metastore so we avoid inconsistency between pagestore and metastore

### Does this PR introduce any user facing changes?
na

			pr-link: Alluxio#18618
			change-id: cid-7d4803cb2d8c433d3ad077a7bf9e3074d91b3389
### What changes are proposed in this pull request?
add back netty dependency within grpc


### Why are the changes needed?
previously we exclude netty dependency since we already have netty-all outside Alluxio#9942

But due to grpc/grpc-java#11284, we sometimes have incompatibility between grpc and netty, and it's better to use shaded netty within grpc so we can be sure that they are compatible.

### Does this PR introduce any user facing changes?

na

			pr-link: Alluxio#18642
			change-id: cid-65d86f315e023592060b6a9f864bfe2a972dfe68
Install UCX on dockerfile jdk 8 for github PR test
			pr-link: Alluxio#18663
			change-id: cid-44eb2b7eb1c432ca4dd3b8050fa1dd908f0b124e
@alluxio-bot
Copy link
Contributor

Thank you for your pull request.
In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement (CLA).
It's all electronic and will take just a few minutes. Please download CLA form here, sign, and e-mail back to [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix using unavailable worker cached in client-pool