This change introduces the configuration option 'full_backend_cache'
which changes the cache to be a full mirror of the backend instead
of a per-object cache. This allows all sorcery retrieval operations
to be carried out against it and is useful for object types which
are used in a "retrieve all" or "retrieve some" pattern.
ASTERISK-25625 #close
Change-Id: Ie2993487e9c19de563413ad5561c7403b48caab5
The JSON library Asterisk uses, jansson, is not thread
safe for us in a few ways. To help with this wrappers for JSON
object reference count increasing and decreasing were added
which use a global lock to ensure they don't clobber over
each other. This does not extend to reference count manipulation
within the jansson library itself. This means you can't safely
use the object borrowing specifier (O) in ast_json_pack and
you can't share JSON instances between objects.
This change removes uses of the O specifier and replaces them
with the o specifier and an explicit ast_json_ref. Some cases
of instance sharing have also been removed.
ASTERISK-25601 #close
Change-Id: I06550d8b0cc1bfeb56cab580a4e608ae4f1ec7d1
- Trigger pending DTLS packets to send out, once the RTP instance's remote
address is set.
- Avoids locking the DTLS structure unnecessarily by only doing this if
DTLS is passive.
- Add DTLS locks around the structurally sensitive calls in the SSL
portion of __rtp_recvfrom, since dtls_srtp_check_pending does not lock
inside of itself, and we're dealing with the SSL BIO in at least two
threads.
WebRTC channels may receive a DTLS handshake before
ast_rtp_remote_address_set is called, which causes there to be a pending
response to send out. Previous to 1ad827, this was handled by calling
dtls_srtp_check_pending on receipt of any RTP packet - a STUN or RTP
packet could trigger the pending handshake response. Since that was
rightfully removed, whenever the DTLS handshake is received before the
remote address is set, we would have to wait until another SSL packet
arrives.
As of Chrome M47's optimizations to their handshake process, WebRTC
conversations between Chrome M47+ and Asterisk, where Asterisk is passive,
experience a 1 second delay without this patch, because the SSL handshake
is received before ICE negotation stores the remote_address, and the next
SSL packet isn't received until after a 1 second timeout in Chrome, which
causes a new handshake request.
ASTERISK-25614 #close
Change-Id: I547f1be7e302dbf71f6553dd8cbc0657b1d0b908
pjproject < 2.5.0 will segfault on a tls transport if async_operations
is greater than 1. A runtime version check has been added to throw
an error if the version is < 2.5.0 and async_operations > 1.
To assist in the check, a new api "ast_compare_versions" was added
to utils which compares 2 major.minor.patch.extra version strings.
ASTERISK-25615 #close
Change-Id: I8e88bb49cbcfbca88d9de705496d6f6a8c938a98
Reported-by: George Joseph
Tested-by: George Joseph
Fixed a bug that originally would show a negative number of
active calls occuring in Asterisk. A gauge is persistent so
incrementing and decrementing it results in a more consistent
performance. Also changed to the call to StatsD to use
ast_statsd_log_string() so that a "+" could be sent to StatsD.
ASTERISK-25619 #close
Change-Id: Iaaeff5c4c6a46535366b4d16ea0ed0ee75ab2ee7
Both transport and endpoint now check for the existence and readability
of tls certificate and key files before passing them on to pjproject.
This will cause the object to not load rather than waiting for pjproject
to discover that there's a problem when a session is attempted.
NOTE: chan_sip also uses ast_rtp_dtls_cfg_parse but it's located
in build_peer which is gigantic and I didn't want to disturb it.
Error messages will emit but it won't interrupt chan_sip loading.
ASTERISK-25618 #close
Change-Id: Ie43f2c1d653ac1fda6a6f6faecb7c2ebadaf47c9
Reported-by: George Joseph
Tested-by: George Joseph
See ASTERISK-25615.
If the transport protocol is tls and async_operations > 1, pjproject
will segfault if more than one operation is attempted on the same socket.
Until this is fixed upstream, a check has been added to throw an error
if a tls transport config has async_operations set to > 1.
ASTERISK-25615
Change-Id: I76b9a5b2a5a0054fe71ca5851e635f2dca7685a6
Reported-by: George Joseph
Tested-by: George Joseph
It will never be perfect or even pretty, mostly because of the differences
between static and dynamic contacts.
Created:
Can't use the contact or contact_status alloc functions
because the objects come and go regardless of the actual state.
Can't use the contact_apply_handler, ast_sip_location_add_contact or
a sorcery created handler because they only get called for dynamic
contacts. Similarly, permanent_uri_handler only gets called for
static contacts.
So, Matt had it right. :) ast_res_pjsip_find_or_create_contact_status is
the only place it can go and not have duplicated code. Both
permanent_uri_handler and contact_apply_handler call find_or_create.
Removed:
Can't use the destructors for the same reason as above. The only
place to put this is in persistent_endpoint_contact_deleted_observer
which I believe is the "correct" place but even that will handle only
dynamic contacts. This doesn't called on shutdown however. There is
no hook to use for static contacts that may be removed because of a
config change while asterisk is in operation.
I moved the cleanup of contact_status from ast_sip_location_delete_contact
to the handler as well.
Status Change and RTT:
Although they worked fine where they were (in update_contact_status) I
moved them to persistent_endpoint_contact_status_observer to make it
more consistent with removed. There was logic there already to detect
a state change.
Finally, fixed a nit in permanent_uri_handler rmudgett reported
eralier.
ASTERISK-25608 #close
Change-Id: I4b56e7dfc3be3baaaf6f1eac5b2068a0b79e357d
Reported-by: George Joseph
Tested-by: George Joseph
Beside that, the format-attribute module sends only non-default values in the
line fmtp, now. This avoids unnecessary overhead in SDP messages. Furthermore,
previously the parameter stereo was not parsed when being the first parameter.
ASTERISK-25583 #close
Change-Id: Iae85ba3e5960bfd5d51cf65bcffad00dd4875a73
When 90d9a70789 was merged, it mostly tested dynamic contacts created as
a result of registering a PJSIP endpoint. Contacts generated in this
fashion typically have a long alphanumeric string as their object identifier,
which maps reasonably well for StatsD. Unfortunately, this doesn't work in the
general case. StatsD treats both '.' and ':' characters as special characters.
In particular, having a ':' appear in the middle of a StatsD metric will
result in the metric being rejected.
This causes some obvious issues with SIP URIs.
The StatsD API should not be responsible for escaping the metric name passed
to it. The metric is treated as a single long string, and it would be
challenging to know what to escape in the string passed to the function.
Likewise, we don't want to escape the metric in PJSIP, as that involves
overhead that is wasted when either res_statsd isn't loaded or enabled.
This patch takes an alternative approach. The Contact ID has been changed
to be "aor@@uri_hash" instead of "aor@@uri". This (a) won't contain any of the
aforementioned special characters, (b) can be done on Contact creation,
which has minimal impact on run-time performance, and (c) also conforms to an
earlier commit that changed the ID for dynamic contacts.
The downside of this is that StatsD users will have to map SHA1 hashes back to
the Contacts that are emitting the statistics. To that end, the CLI commands
have been updated to include the first 10 characters of the MD5 hash, which
should be enough to match what is shown in Graphite (or some other StatsD
backend).
ASTERISK-25595 #close
Change-Id: Ic674a3307280365b4a45864a3571c295b48a01e2
Reported-by: Matt Jordan
Tested-by: George Joseph
An earlier commit changed the id of dynamic contacts to contain
a hash instead of the uri. This patch updates status change
logging to show the aor/uri instead of the id. This required
adding the aor id to contact and contact_status and adding
uri to contact_status. The aor id gets added to contact and
contact_status in their allocators and the uri gets added to
contact_status in pjsip_options when the contact_status is
created or updated.
ASTERISK-25598 #close
Reported-by: George Joseph
Tested-by: George Joseph
Change-Id: I56cbec1d2ddbe8461367dd8b6da8a6f47f6fe511
The fastagi record-file testsuite test sometimes fails reporting an empty
recorded file. This was happening because Asterisk was sending the agi result
notification prior to actually closing the file and the data, being buffered,
had not been written to the file yet when the test attempts to check the file
size.
This patch makes it so the record file stream is closed prior to sending the
agi result notification.
ASTERISK-25593 #close
Change-Id: I6b2b3be3ae37f7c7b18e672c419a89b3b8513cde
The usage info for 'pjsip send notify' previously referenced the
chan_sip configuration sip_notify.conf. Fix this to reference
the correct configuration pjsip_notify.conf.
ASTERISK-25590 #close
Change-Id: I3898271a8e8a8b1db201741e790ebe2c6bf5cdea
If the sorcery object type is not found a NULL is returned.
Unfortunately, sorcery_realtime_filter_objectset() will crash after
complaining about not finding the object type and saying to expect errors.
* Use ao2_cleanup() instead of ao2_ref() to prevent the crash.
ASTERISK-25165
Reported by Corey Farrell
Change-Id: Ic3b64453ea3058cb68d5c26d97d4fe7b8eea2e97
When a channel is in a direct media bridge, a re-INVITE may arrive that forces
Asterisk to re-negotiate the media to a T.38 fax. When this occurs, the bridge
must change its technology to a simple bridge, and re-INVITE the media back
to Asterisk.
Generally, this logic mostly already exists in Asterisk. However, prior to this
patch, there were a few bugs:
(1) The T.38 framehook currently prevents a channel capable of T.38 faxes from
ever entering into a direct media bridge. This applies even when the only
media being passed over the channel is audio. This patch fixes this bug
by having the framehook specify that it defers caring about any frame type.
This allows the channels to enter into a direct media bridge, which will
be broken when a re-INVITE is received.
(2) When a re-INVITE is received, nothing instructed the bridging layer to
re-inspect the allowed bridging technology. This now occurs when either
a re-INVITE is received from a peer, or when a response is received from
the far end (that is, when the T.38 state changes to either
T38_PEER_REINVITE or T38_LOCAL_REINVITE).
(3) chan_pjsip needs to do a small amount of work to prevent a direct media
bridge from being chosen when a T.38 session is in progress. When a T.38
session supplement has a t38 datastore - which is added when we detect
we should start thinking about T.38 on a channel - we now refuse a native
RTP bridge.
(4) When a BYE request is received, we don't terminate the T.38 session. If
the other side of a T.38 fax survives the hangup (due to the 'g' flag
in Dial, for example), we don't currently re-INVITE the media on the
other channel back to audio. This patch now has res_pjsip_t38 intercept
BYE requests and inform the far side that the T.38 session is terminated.
This naturally causes the correct re-INVITEs to be sent.
ASTERISK-25582
Change-Id: Iabd6aa578e633d16e6b9f342091264e4324a79eb
This patch adds some debug statements to res_pjsip_t38. These statements help
to determine which SDP negotiation callbacks are being executed, and, when
a particular callback exits, why a callback may not have applied its logic
to the local or remote SDP.
Change-Id: I61b3fb9183b7ebbb5da8e9f48b59a5d9d7042d77
This patch adds a module that emits StatsD statistics about Asterisk
endpoints. This includes:
* A GUAGE statistic for endpoint states, tracking how many endpoints are in
a particular state.
* A GUAGE statistic for each endpoint, counting the number of channels
currently associated with an endpoint.
ASTERISK-25572
Change-Id: If7e1333c5aeda8d136850b30c2101c0ee1c97305
This patch adds the ability to send StatsD statistics related to the
state of PJSIP contacts. This includes:
* A GUAGE statistic measuring the count of contacts in a particular state.
This measures how many contacts are reachable, unreachable, etc.
* The RTT time for each contact, if those contacts are qualified. This
provides StatsD engines useful time-based data about each contact.
ASTERISK-25571
Change-Id: Ib8378d73afedfc622be0643b87c542557e0b332c
This patch adds outbound registration statistics for StatsD. This includes
the following:
* A GUAGE metric for the overall count of outbound registrations.
* A GUAGE metric for each state an outbound registration can be in. As the
outbound registrations change state, the overall count of how many
outbound registrations are in the particular state is changed.
These statistics are particularly useful for systems with a large number of
SIP trunks, and where measuring the change in state of the trunks is useful
for monitoring.
ASTERISK-25571
Change-Id: Iba6ff248f5d1c1e01acbb63e9f0da1901692eb37
When Asterisk is configured to use a dynamic sorcery backend (such as
res_sorcery_astdb) with 'registration' objects, it will fail to create the
internal state objects associated with the registration objects on module
load. This is due to nothing actually querying for the specific objects
and calling their sorcery apply handler during module load.
This patch fixes that by calling get_registrations in the sorcery observer's
object_type_loaded handler. Doing this causes the sorcery backends to be
asked for the current state of all registration objects, which causes the
apply handler to be called and the internal run-time state to be created.
ASTERISK-25575 #close
Change-Id: Ie9306e797098c6d4da7bcf4a5434a15891508b23
When no parameter is present, Asterisk does not generate the line fmtp, as
expected. However, because a buffer was reset, even rtpmap and fmtp of previous
media codecs got removed. Now, Asterisk does not reset other codecs in case of
no parameter for H.264.
ASTERISK-25573 #close
Change-Id: I93811331f4a28c45418a9e14ee46c0debd47a286
Often, the metric names of statistics we are generating for StatsD have some
dynamic component to them. This can be the name of a particular resource, or
some internal status label in Asterisk. With the current set of functions,
callers of the statsd API must first build the metric name themselves, then
pass this to the API functions. This results in a large amount of boilerplate
code and usage of either fixed length static buffers or dynamic memory
allocation, neither of which is desireable.
This patch adds two new functions to the StatsD API that support a printf
style format specifier for constructing the metric name. A dynamic string,
allocated in threadstorage, is used to build the metric name. This eases
the burden on users of the StatsD API.
Change-Id: If533c72d1afa26d807508ea48b4d8c7b32f414ea
Receiving a 423 Interval Too Brief response after authentication for an
outbound registration attempt results in assuming that the registrar has
rejected the registration permanently. If there are no configured retries
for fatal responses then the outbound registration is stopped for that
endpoint.
For registrations, PJSIP/PJPROJECT intercepts the handling of 423
responses and does not include any authentication in the updated
registration request. When the updated request is challenged then the
Asterisk code assumes that we were challenged again because the peer
rejected the authentication we sent earlier.
* Made registration challenges keep track of the CSeq number to determine
if the received challenge response was for the request we thought we sent.
If the response's CSeq number differs from the CSeq number we last sent
with authentication then authenticate again because it is a challenge to a
different request.
Change-Id: I81b4bd36d1be095bab606e34b8b44e6302971b09
Added a new api to res_statsd.c to allow it to receive a
character pointer for the value argument. This allows for a
'+' and a '-' to easily be sent with the value.
ASTERISK-25419
Reported By: Ashley Sanders
Change-Id: Id6bb53600943d27347d2bcae26c0bd5643567611
When a request is sent using pjsip_endpt_send_request and fails, a condition
exists where the request wrapper, which is an AO2 object, may be de-ref'd
more times than it should. This occurs when the request's callback is called,
and, in the callback, the timer on the PJSIP heap is cancelled. When that
occurs, the request wrapper's lifetime is decremented. When
pjsip_endpt_send_request fails, we unilaterally decrement the lifetime of
the request wrapper again, even though we've already cancelled the reference
associated with the timer.
This patch checks the return result of pj_timer_heap_cancel_if_active before
removing the reference associated with the timer. We now only decrement it
in this case if a timer is cancelled as a result of the function call.
Change-Id: I21332343a1a019c1117076f9bf2df27be2850102
If an authenticated incoming caller does not respond to our 200 OK INVITE
response with an ACK then PJSIP will hangup the call. Unfortunately,
there is a chance that the session's channel will go away between one use
of the channel pointer and another when building the BYE request because
the BYE is being built by the monitor thread and not the call's serializer
thread.
* Added a check to ensure that the thread trying to add the Reason header
is the call's serializer thread. This ensures that the channel will not
go away on us.
Change-Id: I866388d2b97ea2032eaae3f3ab3f1ca6cbd2df89
In practical tests, we have seen certain taskprocessors, specifically
Stasis subscription taskprocessors, cross the recently-added high-water
mark and emit a warning. This high-water mark warning is only intended
to be emitted when things have tanked on the system and things are
heading south quickly. In the practical tests, the Stasis taskprocessors
sometimes had a max depth of 180 tasks in them, and Asterisk wasn't in
any danger at all.
As such, this ups the high-water mark to 500 tasks instead. It also
redefines the SIP threadpool request denial number to be a multiple of
the taskprocessor high-water mark.
Change-Id: Ic8d3e9497452fecd768ac427bb6f58aa616eebce
When the SIP threadpool is backed up with tasks, we send 503 responses
to ensure that we don't try to overload ourselves. The problem is that
we were not insuring that we were not trying to send a 503 to an
incoming SIP response.
This change makes it so that we only send the 503 on incoming requests.
Change-Id: Ie2b418d89c0e453cc6c2b5c7d543651c981e1404
We have observed situations where the SIP threadpool may become
deadlocked. However, because incoming traffic is still arriving, the SIP
threadpool's queue can continue to grow, eventually running the system
out of memory.
This change makes it so that incoming traffic gets rejected with a 503
response if the queue is backed up too much.
Change-Id: I4e736d48a2ba79fd1f8056c0dcd330e38e6a3816
When ASTERISK-25449 was closed, a number of scheduler issues mentioned in
the comments were missed. These have since beed raised in ASTERISK-25476
and elsewhere.
This patch attempts to collect all of the scheduler issues discovered so
far and address them sensibly.
ASTERISK-25476 #close
Change-Id: I87a77d581e2e0d91d33b4b2fbff80f64a566d05b
In SIP/SDP, Opus has two channels always (see RFC 7587 section 7). The actual
amount of channels is negotiated in-band. Therefore now, the Opus codec and its
attribute rtpmap are registered with two channels.
ASTERISK-24779 #close
Reported by: PowerPBX
Tested by: Alexander Traud
patches:
asterisk-24779.patch submitted by Sean Bright (license #5060)
Change-Id: Ic7ac13cafa1d3450b4fa4987350924b42cbb657b
The contact_status Sorcery objects are currently not destroyed when a contact
is deleted. This causes the contact's last known RTT/status to be 'sticky'
when the contact itself may no longer exist. This patch causes the
contact_status objects associated with both dynamic and static contacts to
be destroyed if the AoR holding those contacts is also destroyed (or via
other paths where a contact may be deleted.)
Change-Id: I7feec8b9278cac3c5263a4c0483f4a0f3b62426e
When an endpoint is deleted (such as through an API), the persistent endpoint
currently continues to lurk around. While this isn't harmful from a memory
consumption perspective - as all persistent endpoints are reclaimed on
shutdown - it does cause Stasis endpoint related operations to continue
to believe that the endpoint may or may not exist.
This patch causes the persistent endpoint related to a PJSIP endpoint to be
destroyed if the PJSIP endpoint is deleted.
Change-Id: I85ac707b4d5e6aad882ac275b0c2e2154affa5bb
During a stress test of subscriptions, a huge blast of
subscription-related traffic resulted in the threadpool expanding to a
ridiculous number of threads. The balooning of threads resulted in an
increase of memory, which led to a crash due to being out of memory.
An easy fix for the particular test was to limit the size of the
threadpool, thus reining in the amount of memory that would be used. It
was decided that there really is no downside to having a non-infinite
default value for the maximum size of the threadpool, so this change
introduces 50 threads as the maximum threadpool size for the SIP
threadpool.
ASTERISK-25513 #close
Reported by John Bigelow
Change-Id: If0b9514f1d9b172540ce1a6e2f2ffa1f2b6119be
When an AoR is created or destroyed dynamically, the scheduled OPTIONS
requests that qualify the contacts on the AoR are not necessarily started
or destroyed, particularly for persistent contacts created for that AoR.
This patch adds create/update/delete sorcery observers for an AoR, which
schedule/unschedule the qualifies as expected.
Change-Id: Ic287ed2e2952a7808ee068776fe966f9554bdf7d
When compiled with assertions enabled one will occur when destroying
the subscription tree when UAS dialog creation fails. This is because
the code assumes that a dialog will always exist on a subscription
tree when in reality during this specific scenario it won't.
This change makes it so a dialog is not removed from the subscription
tree if it is not present.
ASTERISK-25505 #close
Change-Id: Id5c182b055aacc5e66c80546c64804ce19218dee
Add the ability to filter output from pjsip list and show commands
using the "like" predicate like chan_sip.
For endpoints, aors, auths, registrations, identifyies and transports,
the modification was a simple change of an ast_sorcery_retrieve_by_fields
call to ast_sorcery_retrieve_by_regex. For channels and contacts a
little more work had to be done because neither of those objects are
true sorcery objects. That was just removing the non-matching object
from the final container. Of course, a little extra plumbing in the
common pjsip_cli code was needed to parse the "like" and pass the regex
to the get_container callbacks.
Some of the get_container code in res_pjsip_endpoint_identifier was also
refactored for simplicity.
ASTERISK-25477 #close
Reported by: Bryant Zimmerman
Tested by: George Joseph
Change-Id: I646d9326b778aac26bb3e2bcd7fa1346d24434f1
During outbound registration it is possible to receive a fatal (any permanent/
non-temporary 4xx, 5xx, 6xx) response from the registrar that is simply due
to a problem with the registrar itself. Upon receiving the failure response
Asterisk terminates outbound registration for the given endpoint.
This patch adds an option, 'fatal_retry_interval', that when set continues
outbound registration at the given interval up to 'max_retries' upon receiving
a fatal response.
ASTERISK-25485 #close
Change-Id: Ibc2c7b47164ac89cc803433c0bbe7063bfa143a2
A certain situation can result in our attempting to send a NOTIFY on a
destroyed dialog. Say we attempt to send a NOTIFY to a subscriber, but
that subscriber has dropped off the network. We end up retransmitting
that NOTIFY until the appropriate SIP timer says to destroy the NOTIFY
transaction. When the pjsip evsub code is told that the transaction has
been terminated, it responds in kind by alerting us that the
subscription has been terminated, destroying the subscription, and then
removing its reference to the dialog, thus destroying the dialog.
The problem is that when we get told that the subscription is being
terminated, we detect that we have not sent a terminating NOTIFY
request, so we queue up such a NOTIFY to be sent out. By the time that
queued NOTIFY gets sent, the dialog has been destroyed, so attempting to
send that NOTIFY can result in a crash.
The fix being introduced here is actually a reintroduction of something
the pubsub code used to employ. We hold a reference to the dialog and
wait to decrement our reference to the dialog until our subscription
tree object is destroyed. This way, we can send messages on the dialog
even if the PJSIP evsub code wants to terminate earlier than we would
like.
In doing this, some NULL checks for subscription tree dialogs have been
removed since NULL dialogs are no longer actually possible.
Change-Id: I013f43cddd9408bb2a31b77f5db87a7972bfe1e5
When sending a NOTIFY, we lock the dialog and then unlock the dialog
when finished. A recent change made it so that the subscription tree's
dialog pointer will be set NULL when sending the final NOTIFY request
out. This means that when we attempt to unlock the dialog, we pass a
NULL pointer to pjsip_dlg_dec_lock(). The result is that the dialog
remains locked after we think we have unlocked it. When a response to
the NOTIFY arrives, the monitor thread attempts to lock the dialog, but
it cannot because we never released the dialog lock. This results in
Asterisk being unable to process incoming SIP traffic any longer.
The fix in this patch is to use a local pointer to save off the pointer
value of the subscription tree's dialog when locking and unlocking the
dialog. This way, if the subscription tree's dialog pointer is NULLed
out, the local pointer will still have point to the proper place and the
dialog lock will be unlocked as we expect.
Change-Id: I7ddb3eaed7276cceb9a65daca701c3d5e728e63a
The SIP dialog is removed from the subscription tree when the final
NOTIFY is sent. However, after the final NOTIFY is sent, the persistence
update function still attempts to access the cseq from the dialog,
resulting in a crash.
This fix removes the subscription persistence at the same time that the
dialog is removed from the subscription tree. This way, there is no
attempt to update persistence when the subscription is being destroyed.
Change-Id: Ibb46977a6cef9c51dc95f40f43446e3d11eed5bb
There have been crashes seen where a taskprocessor's listener is NULL
unexpectedly.
Looking at backtraces, the problem was specifically seen in PJSIP
serializers.
Subscriptions make the mistake of removing a serializer from a dialog
during subscription tree destruction. Since subscription trees are
reference-counted, guaranteeing the circumstances behind the destruction
are not possible. This makes it so that the dialog serializer can be
removed while not holding the dialog lock. This makes it possible for
the distributor to get a pointer to the dialog serializer and have that
serializer get freed out from under it.
The fix for this is to remove the serializer from a subscription dialog
when sending the final NOTIFY. This guarantees that the serializer is
removed with the dialog lock held. By doing this, we guarantee that if
the distributor gains access to the dialog's serializer, it will not be
possible for the serializer to get freed by another thread.
Change-Id: I21f5dac33529f65cec45679bdace60670800ff66
If an old persistent subscription is recreated but then immediately
destroyed because it is out of date, the subscription tree will have no
leaf subscriptions on it. This was resulting in a crash when attempting
to destroy the subscription tree.
A simple NULL check fixes this problem.
Change-Id: I85570b9e2bcc7260a3fe0ad85904b2a9bf36d2ac
There have been crashes and general instability seen in the pubsub code,
so this patch introduces three changes to increase the stability.
First, the ownership model for subscriptions has been modified. Due to
RLS, subscriptions are stored in memory as a tree structure. Prior to my
patch, the PJSIP subscription was the owner of the subscription tree.
When the PJSIP subscription told us that it was terminating, we started
destroying the subscription tree along with all of the individual leaf
subscriptions that belong to the tree. The problem with this model is
that the two actors in play here, the PJSIP subscription and the
individual leaf subscriptions, need to have joint ownership of the
subscription tree. So now, the PJSIP subscription and the individual
leaf subscriptions each have a reference to the subscription tree. This
way, we will not actually free memory until no players are left that
care. The PJSIP subscription is a bigger stakeholder, in that if the
PJSIP subscription's reference to the subscription tree is removed, the
subscription tree instructs the leaf subscriptions to shut down and drop
their references to the subscription tree when possible. The individual
leaf subscriptions, upon being told to shut down, can drop their stasis
subscriptions or whatever they use to learn of new state, and then drop
their reference to the subscription tree once they are ready to die.
Second, the lifetime of a PJSIP subscription's reference to our
subscription tree has been altered. As I learned from doing a deep dive,
the PJSIP evsub code can tell Asterisk multiple times that the
subscription has been terminated, and not all of these times
are especially helpful. I have altered the message flow that we use for
SIP subscriptions such that we will always drop the PJSIP subscription's
reference to the subscription tree when we send the NOTIFY that
terminates a SIP subscription. This also means that we will now queue
NOTIFY requests to be sent after responding to incoming SUBSCRIBEs so
that we can have predictable state changes from the PJSIP evsub code.
Third, the synchronization of operations has been improved. PJSIP can
call into our code from a serializer thread (e.g. upon receiving an
incoming request) or from the monitor thread (e.g. when a subscription
times out). Because of this, there is the possibility of competing
threads stepping on each other. PJSIP attempts to do some
synchronization on its own by always keeping the dialog lock held when
it calls into us. However, since we end up pushing tasks into the
serializer, the result was that serialized operations were not grabbing
the dialog lock and could, as a result, step on something that was being
attempted by a different thread. Now we ensure that serialized
operations grab the dialog lock, then check for extenuating
circumstances, then proceed with their operation if they can.
Change-Id: Iff2990c40178dad9cc5f6a5c7f76932ec644b2e5
In a realtime based system with a limited number of threadpool threads
it is possible for a deadlock to occur. This happens when permanent
endpoint state is updated, which will cause database queries to be done.
These queries may result in URI validation being done which is done
synchronously using a PJSIP thread. If all PJSIP threads are in use
processing traffic they themselves may be blocked waiting to get the
permanent endpoint container lock when identifying an endpoint.
This change moves URI validation to occur at use time instead of
configuration time. While this comes at a cost of not seeing a problem
until you use it it does solve the underlying deadlock problem.
ASTERISK-25486 #close
Change-Id: I2d7d167af987d23b3e8199e4a68f3359eba4c76a
On v13, loading several thousand PJSIP endpoints on Asterisk start causes
a deadlock most of the time.
Thanks to mdu113 for discovering that there was a call to pgsql_exec() not
protected by the pgsql_lock reentrancy lock.
{quote}
I believe a code path exists that attempts to use pgsql connection without
locking pgsql_lock. I believe what happens during that deadlock that I
see is two concurrent threads are both attempting to send query to pgsql,
one of the thread is using a code path without locking pgsql_lock. If
they managed to send queries at the same time, it seems postgres ignores
one of the queries and replies only to the one of them. If it happens so
that the thread holding the lock didn't receive the reply it will wait for
it (and hold the lock) forever (or at least for very long time), thus
completely blocking all access to db.
{quote}
* Added missing reentrancy locking around pgsql_exec() in find_table().
* Moved unlock of pgsql_lock in unload_module() to avoid locking inversion
between the psql_tables list lock and the pgsql_lock.
ASTERISK-25455 #close
Reported by: mdu113
Patches:
res_config_pgsql.c-connlock2.diff (license #5543) patch uploaded by mdu113
Change-Id: Id9e7cdf8a3b65ff19964b0cf942ace567938c4e2
The struct send_request_wrapper has a pjsip lock associated with it that
is created non-recursive. There is a code path for the struct
send_request_wrapper lock that will attempt to lock it recursively. The
reporter's deadlock showed that the thread calling endpt_send_request()
deadlocked itself right after the wrapper object got created.
Out-of-dialog requests such as MESSAGE, qualify OPTIONS, and unsolicited
MWI NOTIFY messages can hit this deadlock.
* Replaced the struct send_request_wrapper pjsip lock with the mutex lock
that can come with an ao2 object since all of Asterisk's mutexes are
recursive. Benefits include removal of code maintaining the pjsip
non-recursive lock since ao2 objects already know how to maintain their
own lock and the lock will show up in the CLI "core show locks" output.
ASTERISK-25435 #close
Reported by: Dmitriy Serov
Change-Id: I458e131dd1b9816f9e963f796c54136e9e84322d
In ast_rtp_read, the value of the variable 'mark' which we try to assign to a
frame->subclass.frame_ending may be 0, 1 or (1<<23), but we should translate
it to 0 or 1.
ASTERISK-25451 #close
Change-Id: I53bdf5c026041730184a6a809009c028549ce626
When we decide we will no longer schedule an RTCP write, we remove the
reference to the RTP instance, then assign -1 to the stored scheduler ID
in case something else comes along and wants to see if anything is scheduled.
That scheduler ID is on the RTP instance. After 60a9172d7e was merged to
fix the regression introduced by 3cf0f29310, this improper assignment on a
potentially destroyed object started getting tripped on the build agents.
Frankly, this should have been crashing a lot more often earlier. I can only
assume that the timing was changed just enough by both changes to start
actually hitting this problem.
As it is, simply moving the assignment prior to the ao2 deference is sufficient
to keep the RTP instance from being referenced when it is very, truly,
aboslutely dead.
(Note that it is still good practice to assign -1 to the scheduler ID when we
know we won't be scheduling it again, as the ao2 deref *may* not always destroy
the ao2 object.)
ASTERISK-25449
Change-Id: Ie6d3cb4adc7b1a6c078b1c38c19fc84cf787cda7
Apparently some endpoints attempt to send a reINVITE before completing the
initial INVITE transaction. In this case PJSIP responds appropriately to
the reINVITE with a 491 INVITE request pending. Unfortunately chan_pjsip
is using the initial INVITE transaction state to determine if an INVITE is
the initial INVITE or a reINVITE. Since the initial INVITE transaction
has not been confirmed yet chan_pjsip thinks the reINVITE is an initial
INVITE and starts another PBX thread on the channel. The extra PBX thread
ensures that hilarity ensues.
* Fix checks for a reINVITE on incoming requests to look for the presence
of a to-tag instead of the initial INVITE transaction state.
* Made caller_id_incoming_request() determine what to do if there is a
channel on the session or not. After a channel is created it is too late
to just store the new party id on the session because the session's party
id has already been copied to the channel's caller id.
ASTERISK-25404 #close
Reported by: Chet Stevens
Change-Id: Ie78201c304a2b13226f3a4ce59908beecc2c68be
When 5c713fdf18 was merged, it allowed for scheduled items to have an ID of
'0' returned. While this was valid per the documentation for the API, it was
apparently never returned previously. As a result, several users of the
scheduler API viewed the result as being invalid, causing them to reschedule
already scheduled items or otherwise fail in interesting ways.
This patch corrects the users such that they view '0' as valid, and a returned
ID of -1 as being invalid.
Note that the failing HEP RTCP tests now pass with this patch. These tests
failed due to a duplicate scheduling of the RTCP transmissions.
ASTERISK-25449 #close
Change-Id: I019a9aa8b6997584f66876331675981ac9e07e39
A deadlock can happen when a sorcery object is being expired from the
memory cache when at the same time another object is being placed into the
memory cache. There are a couple other variations on this theme that
could cause the deadlock. Basically if an object is being expired from
the sorcery memory cache at the same time as another thread tries to
update the next object expiration timer the deadlock can happen.
* Add a deadlock avoidance loop in expire_objects_from_cache() to check if
someone is trying to remove the scheduler callback from the scheduler.
ASTERISK-25441 #close
Change-Id: Iec7b0bdb81a72b39477727b1535b2539ad0cf4dc
Make sorcery_memory_cache_close() call remove_all_from_cache() instead of
partially inlining it.
ASTERISK-25441
Change-Id: I1aa6cb425b1a4307096f3f914d17af8ec179a74c
Basically you should shutdown in the opposite order of how you setup since
later setup pieces likely depend on earlier setup pieces. e.g.,
Registering your external API with the rest of the system should be the
last thing setup and the first thing unregistered during shutdown.
Change-Id: I5715765b723100c8d3c2642e9e72cc7ad5ad115e
In practice the set_role API callback can be invoked even
when no ICE is present on an RTP instance. This can occur
if ICE has not been enabled on it.
ASTERISK-25438 #close
Change-Id: I0e17e4316f0f0d7f095c78c3d4fd73a913b6ba69
* Now conf_alloc() has more off nominal error checking.
* Eliminated RAII_VAR() use in conf_alloc().
* Eliminated a dubius shortcut when destroying cfg->general in
conf_destructor() that would cause a crash if cfg->general failed to get
allocated.
* Add some ACO registration section comments.
Change-Id: Ia40c2b1b2d0777d641605118ae019c5a73865e1a
Need to finish initializing the string fields in the ao2 object before
putting any default strings into them.
ASTERISK-25383 #close
Reported by: yaron nahum
Change-Id: I9f7f3a03f0c4991a01593abf8697b9a587c0ea84
When b99a705262 was merged, subscribing to a
NULL bridge will now cause app_subscribe_bridge to implicitly subscribe to
all bridges. Unfortunately, the res_stasis control loop did not check that
a bridge changing on a channel's control object was actually also non-NULL.
As a result, app_subscribe_bridge will be called with a NULL bridge when a
channel leaves a bridge. This causes a new subscription to be made to the
bridge. If an application has also subscribed to the bridge, the application
will now have two subscriptions:
(1) The explicit one created by the app
(2) The implicit one accidentally created by the control structure
As a result, the 'BridgeDestroyed' event can be sent multiple times. This
patch corrects the control loop such that it only subscribes an application
to a new bridge if the bridge pointer is non-NULL.
ASTERISK-24870
Change-Id: I3510e55f6bc36517c10597ead857b964463c9f4f
This patch adds support for receiving events regarding Peer status changes
and Contact status changes. This is particularly useful in scenarios where
we are subscribed to all endpoints and channels, where we often want to know
more about the state of channel technology specific items than a single
endpoint's state.
ASTERISK-24870
Change-Id: I6137459cdc25ce27efc134ad58abf065653da4e9
This patch adds support for subscribing to all device state changes. This is
done either by subscribing to an empty device, e.g., 'eventSource=deviceState:',
or by the WebSocket connection specifying that it wants all state in the
system.
ASTERISK-24870
Change-Id: I9cfeca1c9e2231bd7ea73e45919111d44d2eda32
This patch adds the ability to subscribe to all events. There are two possible
ways to accomplish this:
(1) On initial WebSocket connection. This patch adds a new query parameter,
'subscribeAll'. If present and True, Asterisk will subscribe the
applications to all ARI events.
(2) Via the applications resource. When subscribing in this manner, an ARI
client should merely specify a blank resource name, i.e., 'channels:'
instead of 'channels:12354'. This will subscribe the application to all
resources of the 'channels' type.
ASTERISK-24870 #close
Change-Id: I4a943b4db24442cf28bc64b24bfd541249790ad6