Commits · scylla-5.2.4 · Masters-thesis / DB Sources / ScyllaDB

Jun 22, 2023
- release: prepare for 5.2.4 · cebbf6c5
  Anna Mikhlin authored 1 year ago
  
  View commits for tag scylla-5.2.4 scylla-5.2.4
  
  cebbf6c5
Jun 21, 2023

Update seastar submodule (default priority class shares) · 73b86699

Avi Kivity authored 1 year ago

* seastar 32ab15cda6...29a0e64513 (1):
  > reactor: change shares for default IO class from 1 to 200

Fixes #13753.

In 5.3: 37e6e652

73b86699

Jun 15, 2023

Merge 'Backport 5.2 test.py stability/UX improvemenets' from Kamil Braun · 9efca96c

Botond Dénes authored 1 year ago

Backport the following improvements for test.py topology tests for CI stability:
- https://github.com/scylladb/scylladb/pull/12652
- https://github.com/scylladb/scylladb/pull/12630
- https://github.com/scylladb/scylladb/pull/12619
- https://github.com/scylladb/scylladb/pull/12686
- picked from https://github.com/scylladb/scylladb/pull/12726: 9ceb6aba
- picked from https://github.com/scylladb/scylladb/pull/12173: fc604844
- https://github.com/scylladb/scylladb/pull/12765
- https://github.com/scylladb/scylladb/pull/12804
- https://github.com/scylladb/scylladb/pull/13342
- https://github.com/scylladb/scylladb/pull/13589
- picked from https://github.com/scylladb/scylladb/pull/13135: 7309a1bd
- picked from https://github.com/scylladb/scylladb/pull/13134: 21b505e6, a4411e9e, c1d0ee2b, 8e3392c6, 794d0e40, e407956e
- https://github.com/scylladb/scylladb/pull/13271
- https://github.com/scylladb/scylladb/pull/13399
- picked from https://github.com/scylladb/scylladb/pull/12699: 3508a4e4, 08d754e1, 62a945cc, 041ee3ff
- https://github.com/scylladb/scylladb/pull/13438 (but skipped the test_mutation_schema_change.py fix since I didn't backport this new test)
- https://github.com/scylladb/scylladb/pull/13427
- https://github.com/scylladb/scylladb/pull/13756
- https://github.com/scylladb/scylladb/pull/13789
- https://github.com/scylladb/scylladb/pull/13933 (but skipped the test_snapshot.py fix since I didn't backport this new test)

Closes #14215

* github.com:scylladb/scylladb:
  test: pylib: fix `read_barrier` implementation
  test: pylib: random_tables: perform read barrier in `verify_schema`
  test: issue a read barrier before checking ring consistency
  Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr
  test/pylib: ManagerClient helpers to wait for...
  test: pylib: Add a way to create cql connections with particular coordinators
  test/pylib: get gossiper alive endpoints
  test/topology: default replication factor 3
  test/pylib: configurable replication factor
  scylla_cluster.py: optimize node logs reading
  test/pylib: RandomTables.add_column with value column
  scylla_cluster.py: add start flag to server_add
  ServerInfo: drop host_id
  scylla_cluster.py: add config to server_add
  scylla_cluster.py: add expected_error to server_start
  scylla_cluster.py: ScyllaServer.start, refactor error reporting
  scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
  test: improve logging in ScyllaCluster
  test: topology smp test with custom cluster
  test/pylib: topology: support clusters of initial size 0
  Merge 'test/pylib: split and refactor topology tests' from Alecco
  Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
  test: Increase START_TIMEOUT
  test/pylib: one-shot error injection helper
  test: topology: wait for token ring/group 0 consistency after decommission
  test: topology: verify that group 0 and token ring are consistent
  Merge 'pytest: start after ungraceful stop' from Alecco
  Merge 'test.py: improve test failure handling' from Kamil Braun

9efca96c

Jun 14, 2023

Backport 'Merge 'Enlighten messaging_service::shutdown()'' · 210e3d19

Pavel Emelyanov authored 1 year ago

This includes seastar update titled
  'Merge 'Split rpc::server stop into two parts''

* br-5.2-backport-ms-shutdown:
  messaging_service: Shutdown rpc server on shutdown
  messaging_service: Generalize stop_servers()
  messaging_service: Restore indentation after previous patch
  messaging_service: Coroutinize stop()
  messaging_service: Coroutinize stop_servers()
  Update seastar submodule

refs: #14031

210e3d19

messaging_service: Shutdown rpc server on shutdown · 702d622b

Pavel Emelyanov authored 1 year ago


The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

702d622b

messaging_service: Generalize stop_servers() · db446302

Pavel Emelyanov authored 1 year ago

Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

db446302

messaging_service: Restore indentation after previous patch · 5d3d64ba
Pavel Emelyanov authored 1 year ago
```
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
```
5d3d64ba
messaging_service: Coroutinize stop() · 079f5d8e
Pavel Emelyanov authored 1 year ago
```
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
```
079f5d8e
messaging_service: Coroutinize stop_servers() · fd7310b1
Pavel Emelyanov authored 1 year ago
```
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
```
fd7310b1

Update seastar submodule · 991d0096

Pavel Emelyanov authored 1 year ago


* seastar 8c86e6de...32ab15cd (1):
  > rpc: Introduce server::shutdown()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

991d0096

Jun 13, 2023

doc: remove support for Ubuntu 18 · 0137ddae

Anna Stuchlik authored 1 year ago

Fixes https://github.com/scylladb/scylladb/issues/14097

This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.

The update is in sync with the change made for
ScyllaDB 5.2.

This commit must be backported to branch-5.2 and
branch-5.3.

Closes #14118

(cherry picked from commit b7022cd7)

0137ddae

compaction: Fix incremental compaction for sstable cleanup · 58f88897

Raphael S. Carvalho authored 1 year ago


After c7826aa9, sstable runs are cleaned up together.

The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.

Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.

Therefore cleanup is affected by the 100% space overhead.

To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.

New unit test reproduces the failure, and passes with the fix.

Fixes #14035.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14038

(cherry picked from commit 23443e05)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14193

58f88897

Jun 12, 2023

test: pylib: fix `read_barrier` implementation · f4115528

Kamil Braun authored 1 year ago

The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.

Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.

This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).

Closes #13933

(cherry picked from commit 64dc76db)
Backport note: skipped the test_snapshot.py change, as the test doesn't
exist on this branch.

f4115528

test: pylib: random_tables: perform read barrier in `verify_schema` · 9c941aba

Kamil Braun authored 1 year ago

`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.

However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.

To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.

Fixes #13788

Closes #13789

(cherry picked from commit 3f3dcf45)

9c941aba

test: issue a read barrier before checking ring consistency · 094bcac3

Konstantin Osipov authored 1 year ago

Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.

When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.

To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.

Fixes #13518 (flaky topology test).

Closes #13756

(cherry picked from commit e7c9ca56)

094bcac3

Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr · e49a531a

Kamil Braun authored 1 year ago

This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.

Closes #13427

* github.com:scylladb/scylladb:
  pylib_test: add tests for read_last_line
  pytest: add pylib_test directory
  scylla_cluster.py: fix read_last_line
  scylla_cluster.py: move read_last_line to util.py

(cherry picked from commit 70f2b093)

e49a531a

test/pylib: ManagerClient helpers to wait for... · bcf99a37

Alejo Sanchez authored 1 year ago


server to see other servers after start/restart

When starting/restarting a server, provide a way to wait for the server
to see at least n other servers.

Also leave the implementation methods available for manual use and
update previous tests, one to wait for a specific server to be seen, and
one to wait for a specific server to not be seen (down).

Fixes #13147

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13438

(cherry picked from commit 11561a73)
Backport note: skipped the test_mutation_schema_change.py fix as the
test doesn't exist on this branch.

bcf99a37

test: pylib: Add a way to create cql connections with particular coordinators · fe4af957

Tomasz Grabiec authored 1 year ago

Usage:

  await manager.driver_connect(server=servers[0])
  manager.cql.execute(f"...", execution_profile='whitelist')

(cherry picked from commit 041ee3ff)

fe4af957

test/pylib: get gossiper alive endpoints · ac5dff7d

Alejo Sanchez authored 2 years ago


Helper to get list of gossiper alive endpoints from REST API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 62a945cc)

ac5dff7d

test/topology: default replication factor 3 · ad99456a

Alejo Sanchez authored 2 years ago

For most tests there will be nodes down, increase replication factor to
3 to avoid having problems for partitions belonging to down nodes.

Use replication factor 1 for raft upgrade tests.

(cherry picked from commit 08d754e1)

ad99456a

test/pylib: configurable replication factor · 937e890f

Alejo Sanchez authored 2 years ago


Make replication factor configurable for the RandomTables helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 3508a4e4)

937e890f

scylla_cluster.py: optimize node logs reading · 12eec5bb

Petr Gusev authored 1 year ago

There are two occasions in scylla_cluster
where we read the node logs, and in both of
them we read the entire file in memory.
This is not efficient and may cause an OOM.

In the first case we need the last line of the
log file, so we seek at the end and move backwards
looking for a new line symbol.

In the second case we look through the
log file to find the expected_error.
The readlines() method returns a Python
list object, which means it reads the entire
file in memory. It's sufficient to just remove
it since iterating over the file instance
already yields lines lazily one by one.

This is a follow-up for #13134.

Closes #13399

(cherry picked from commit 09636b20)

12eec5bb

test/pylib: RandomTables.add_column with value column · 59847389

Alejo Sanchez authored 1 year ago


When adding extra columns in a test, make them value column. Name them
with the "v_" prefix and use the value column number counter.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13271

(cherry picked from commit 81b40c10)

59847389

scylla_cluster.py: add start flag to server_add · 7a8c5db5

Petr Gusev authored 1 year ago

Sometimes when creating a node it's useful
to just install it and not start. For example,
we may want to try to start it later with
expected error.

The ScyllaServer.install method has been made
exception safe, if an exception occurs, it
reverts to the original state. This allows
to not duplicate the try/except logic
in two of its call sites.

(cherry picked from commit e407956e)

7a8c5db5

ServerInfo: drop host_id · 15ea5bf5

Petr Gusev authored 1 year ago

We are going to allow the
ScyllaCluster.add_server function not to
start the server if the caller has requested
that with a special parameter. The host_id
can only be obtained from a running node, so
add_server won't be able to return it in
this case. I've grepped the tests for host_id
and there doesn't seem to be any
reference to it in the code.

(cherry picked from commit 794d0e40)

15ea5bf5

scylla_cluster.py: add config to server_add · 3ab61075

Petr Gusev authored 1 year ago

Sometimes when creating a node it's useful
to pass a custom node config.

(cherry picked from commit 8e3392c6)

3ab61075

scylla_cluster.py: add expected_error to server_start · 1959eddf

Petr Gusev authored 1 year ago

Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.

(cherry picked from commit c1d0ee2b)

1959eddf

scylla_cluster.py: ScyllaServer.start, refactor error reporting · 43525aec

Petr Gusev authored 1 year ago

Extract the function that encapsulates all the error
reporting logic. We are going to use it in several
other places to implement expected_error feature.

(cherry picked from commit a4411e9e)

43525aec

scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed · 930c4e65

Petr Gusev authored 1 year ago

The ScyllaServer expects cmd to be None if the
Scylla process is not running. Otherwise, if start failed
and the test called update_config, the latter will
try to send a signal to a non-existent process via cmd.

(cherry picked from commit 21b505e6)

930c4e65

test: improve logging in ScyllaCluster · d2caaef1

Konstantin Osipov authored 2 years ago

Print IP addresses and cluster identifiers in more log messages,
it helps debugging.

(cherry picked from commit 7309a1bd)

d2caaef1

test: topology smp test with custom cluster · 6474edd6

Alejo Sanchez authored 1 year ago


Instead of decommission of initial cluster, use custom cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13589

(cherry picked from commit ce87aedd)

6474edd6

test/pylib: topology: support clusters of initial size 0 · b39cdadf

Alejo Sanchez authored 1 year ago


To allow tests with custom clusters, allow configuration of initial
cluster size of 0.

Add a proof-of-concept test to be removed later.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13342

(cherry picked from commit e3b46250)

b39cdadf

Merge 'test/pylib: split and refactor topology tests' from Alecco · 7b60cdda

Nadav Har'El authored 2 years ago

Move long running topology tests out of  `test_topology.py` and into their own files, so they can be run in parallel.

While there, merge simple schema tests.

Closes #12804

* github.com:scylladb/scylladb:
  test/topology: rename topology test file
  test/topology: lint and type for topology tests
  test/topology: move topology ip tests to own file
  test/topology: move topology test remove garbaje...
  test/topology: move topology rejoin test to own file
  test/topology: merge topology schema tests and...
  test/topology: isolate topology smp params test
  test/topology: move topology helpers to common file

(cherry picked from commit a24600a6)

7b60cdda

Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun · ea80fe20

Botond Dénes authored 2 years ago

Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.

Closes #12765

* github.com:scylladb/scylladb:
  test/pylib: use larger timeout for decommission/removenode
  test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT

(cherry picked from commit e55f475d)

ea80fe20

test: Increase START_TIMEOUT · f90fe6f3

Asias He authored 2 years ago

It is observed that CI machine is slow to run the test. Increase the
timeout of adding servers.

(cherry picked from commit fc604844)

f90fe6f3

test/pylib: one-shot error injection helper · 6e2c5473

Alejo Sanchez authored 2 years ago


Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.

Fix tests using the previous helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 9ceb6aba)

6e2c5473

test: topology: wait for token ring/group 0 consistency after decommission · 91aa2cd8

Kamil Braun authored 2 years ago

There was a check for immediate consistency after a decommission
operation has finished in one of the tests, but it turns out that also
after decommission it might take some time for token ring to be updated
on other nodes. Replace the check with a wait.

Also do the wait in another test that performs a sequence of
decommissions. We won't attempt to start another decommission until
every node learns that the previously decommissioned node has left.

Closes #12686

(cherry picked from commit 40142a51)

91aa2cd8

test: topology: verify that group 0 and token ring are consistent · 05c3f7ec

Kamil Braun authored 2 years ago

After topology changes like removing a node, verify that the set of
group 0 members and token ring members is the same.

Modify `get_token_ring_host_ids` to only return NORMAL members. The
previous version which used the `/storage_service/host_id` endpoint
might have returned non-NORMAL members as well.

Fixes: #12153

Closes #12619

(cherry picked from commit fa9cf81a)

05c3f7ec

Merge 'pytest: start after ungraceful stop' from Alecco · 3aa73e8b

Kamil Braun authored 2 years ago

If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.

Fixes #12218

Closes #12630

* github.com:scylladb/scylladb:
  pytest: test start after ungraceful stop
  test.py: enable force_schema_commit_log

(cherry picked from commit 5eadea30)

3aa73e8b

Merge 'test.py: improve test failure handling' from Kamil Braun · a0ba3b33

Nadav Har'El authored 2 years ago

Improve logging by printing the cluster at the end of each test.

Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.

Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.

Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.

Closes #12652

* github.com:scylladb/scylladb:
  test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
  test/topology: don't drop random_tables keyspace after a failed test
  test/pylib: mark cluster as dirty after a failed test
  test: pylib, topology: don't perform operations after test on a dirty cluster
  test/pylib: print cluster at the end of test

(cherry picked from commit 2653865b)

a0ba3b33